NFDI4DS | UHH-SEMS - Publication Details

Kai North

ORCID: 0000-0002-9970-2402

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5002957144

Research Areas

Natural Language Processing Techniques
Text Readability and Simplification
Topic Modeling
Hate Speech and Cyberbullying Detection
Authorship Attribution and Profiling
Interpreting and Communication in Healthcare
Swearing, Euphemism, Multilingualism
Names, Identity, and Discrimination Research
Cybercrime and Law Enforcement Studies
Reinforcement Learning in Robotics
Health Literacy and Information Accessibility
Linguistics, Language Diversity, and Identity
Second Language Acquisition and Learning
Bullying, Victimization, and Aggression
Translation Studies and Practices
Advanced Malware Detection Techniques

George Mason University
2022-2025

Rochester Institute of Technology
2021-2022

Findings of the VarDial Evaluation Campaign 2023

OPENALEX - Publications

Noëmi Aepli Çağrı Çöltekin Rob van der Goot Tommi Jauhiainen Mourhaf Kazzaz and 5 more

Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023.

10.18653/v1/2023.vardial-1.25 article EN cc-by 2023-01-01

Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

OPENALEX - Publications

Shrey Satapara Prasenjit Majumder Thomas Mandl Sandip Modha Hiren Madhu and 4 more

In recent years, the spread of online offensive content has become great concern, motivating researchers to develop robust systems capable identifying such automatically. To carry out a fair evaluation these systems, several international shared tasks have been organized, providing community with essential benchmark data and methods for various languages. Organized since 2019, HASOC (Hate Speech Offensive Content Identification) task is one initiatives. its fourth iteration, 2022 included...

10.1145/3574318.3574326 article EN 2022-12-09

The Vulnerability of AI‐Based Scoring Systems to Gaming Strategies: A Case Study

OPENALEX - Publications

Peter Baldwin Victoria Yaneva Kai North Le An Ha Yiyun Zhou and 2 more

Abstract Recent developments in the use of large‐language models have led to substantial improvements accuracy content‐based automated scoring free‐text responses. The reported levels suggest that systems could widespread applicability assessment. However, before they are used operational testing, other aspects their performance warrant examination. In this study, we explore potential for examinees inflate scores by gaming ACTA system. We a range strategies including responding with words...

10.1111/jedm.12427 article EN Journal of Educational Measurement 2025-02-20

Lexical simplification benchmarks for English, Portuguese, and Spanish

OPENALEX - Publications

Sanja Štajner Daniel Ferrés Matthew Shardlow Kai North Marcos Zampieri and 1 more

Even in highly-developed countries, as many 15-30% of the population can only understand texts written using a basic vocabulary. Their understanding everyday is limited, which prevents them from taking an active role society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification natural language processing task that aims to make text understandable everyone by replacing complex vocabulary expressions with simpler ones, while...

10.3389/frai.2022.991242 article EN cc-by Frontiers in Artificial Intelligence 2022-09-22

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

OPENALEX - Publications

Horacio Saggion Sanja Štajner Daniel Ferrés Kim Cheng Sheang Matthew Shardlow and 2 more

Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, Matthew Shardlow, Kai North, Marcos Zampieri. Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022). 2022.

10.18653/v1/2022.tsar-1.31 article EN cc-by 2022-01-01

Deep learning approaches to lexical simplification: A survey

OPENALEX - Publications

Kai North Tharindu Ranasinghe Matthew Shardlow Marcos Zampieri

Abstract Lexical Simplification (LS) is the task of substituting complex words within a sentence for simpler alternatives while maintaining sentence’s original meaning. LS lexical component Text (TS) systems with aim improving accessibility to various target populations such as individuals low literacy or reading disabilities. Prior surveys have been published several years before introduction transformers, transformer-based large language models (LLMs), and prompt learning that drastically...

10.1007/s10844-024-00882-9 article EN cc-by Journal of Intelligent Information Systems 2024-09-02

Language Variety Identification with True Labels

OPENALEX - Publications

Marcos Zampieri Kai North Tommi Jauhiainen Mariano Felice Neha Kumari and 2 more

Language identification is an important first step in many IR and NLP applications. Most publicly available language datasets, however, are compiled under the assumption that gold label of each instance determined by where texts retrieved from. Research has shown this a problematic assumption, particularly case very similar languages (e.g., Croatian Serbian) national varieties Brazilian European Portuguese), may contain no distinctive marker particular or variety. To overcome limitation,...

10.48550/arxiv.2303.01490 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Target-Based Offensive Language Identification

OPENALEX - Publications

Marcos Zampieri Skye Morgan Kai North Tharindu Ranasinghe Austin Simmmons and 3 more

Marcos Zampieri, Skye Morgan, Kai North, Tharindu Ranasinghe, Austin Simmmons, Paridhi Khandelwal, Sara Rosenthal, Preslav Nakov. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 2: Short Papers). 2023.

10.18653/v1/2023.acl-short.66 article EN cc-by 2023-01-01

Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning

OPENALEX - Publications

Md. Mushfiqur Rahman Mohammad Sabik Irbaz Kai North Michelle S. Williams Marcos Zampieri and 1 more

10.1016/j.jbi.2024.104727 article EN Journal of Biomedical Informatics 2024-09-16

MultiLS: An End-to-End Lexical Simplification Framework

OPENALEX - Publications

Kai North Tharindu Ranasinghe Matthew Shardlow Marcos Zampieri

10.18653/v1/2024.tsar-1.1 article EN 2024-01-01

Features of lexical complexity: insights from L1 and L2 speakers

OPENALEX - Publications

Kai North Marcos Zampieri

We discover sizable differences between the lexical complexity assignments of first language (L1) and second (L2) English speakers. The 940 shared tokens without context were extracted compared from three prediction (LCP) datasets: CompLex dataset, Word Complexity Lexicon, CERF-J wordlist. It was found that word frequency, length, syllable count, familiarity, prevalence as well a number derivations had greater effect on perceived for L2 speakers than they did L1 explain these findings in...

10.3389/frai.2023.1236963 article EN cc-by Frontiers in Artificial Intelligence 2023-11-30

LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

OPENALEX - Publications

Abhinandan Desai Kai North Marcos Zampieri Christopher M. Homan

This paper describes team LCP-RIT’s submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), English multi-domain dataset in which words context were annotated respect their complexity using a five point Likert scale. Our system uses logistic regression and wide range linguistic features (e.g. psycholinguistic features, n-grams, word frequency, POS tags) predict single...

10.18653/v1/2021.semeval-1.67 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2021-01-01

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

OPENALEX - Publications

Kai North Marcos Zampieri Tharindu Ranasinghe

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature in context along their candidate substitutions. continue improving performance we introduce ALEXSIS-PT, a novel multi-candidate dataset Brazilian Portuguese containing 9,605...

10.48550/arxiv.2209.09034 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models

OPENALEX - Publications

Kai North Alphaeus Dmonte Tharindu Ranasinghe Marcos Zampieri

This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of task is automatically provide a set candidate substitutions for complex words in context. organizers provided participants with ALEXSIS manually annotated dataset instances split between small trial dozen each three languages competition (English, Portuguese, Spanish) and test over 300 aforementioned languages. To cope lack training data, had either use alternative data...

10.18653/v1/2022.tsar-1.30 article EN cc-by 2022-01-01

An Evaluation of Binary Comparative Lexical Complexity Models

OPENALEX - Publications

Kai North Marcos Zampieri Matthew Shardlow

Identifying complex words in texts is an important first step text simplification (TS) systems. In this paper, we investigate the performance of binary comparative Lexical Complexity Prediction (LCP) models applied to a popular benchmark dataset — CompLex 2.0 used SemEval-2021 Task 1. With data from 2.0, create new contain 1,940 sentences referred as CompLex-BC. Using CompLex-BC, train multiple differentiate which two target more or less same sentence. A linear SVM model achieved best our...

10.18653/v1/2022.bea-1.24 article EN cc-by 2022-01-01

Deep Learning Approaches to Lexical Simplification: A Survey

OPENALEX - Publications

Kai North Tharindu Ranasinghe Matthew Shardlow Marcos Zampieri

Lexical Simplification (LS) is the task of replacing complex for simpler words in a sentence whilst preserving sentence's original meaning. LS lexical component Text (TS) with aim making texts more accessible to various target populations. A past survey (Paetzold and Specia, 2017) has provided detailed overview LS. Since this survey, however, AI/NLP community been taken by storm recent advances deep learning, particularly introduction large language models (LLM) prompt learning. The high...

10.48550/arxiv.2305.12000 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval

OPENALEX - Publications

Kai North Alphaeus Dmonte Tharindu Ranasinghe Matthew Shardlow Marcos Zampieri

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of original sentence. The TSAR-2022 shared task on LS provided participants multilingual lexical test set. It contained nearly 1,200 complex in English, Portuguese, and Spanish presented multiple candidate substitutions each word. competition did not make training data available; therefore, teams had use either...

10.18653/v1/2023.bea-1.33 article EN cc-by 2023-01-01

Health Text Simplification: An Annotated Corpus for Digestive Cancer Education and Novel Strategies for Reinforcement Learning

OPENALEX - Publications

Md. Mushfiqur Rahman Mohammad Sabik Irbaz Kai North Michelle S. Williams Marcos Zampieri and 1 more

Objective: The reading level of health educational materials significantly influences information understandability and accessibility, particularly for minoritized populations. Many patient resources surpass the complexity widely accepted standards. There is a critical need high-performing text simplification models in to enhance dissemination literacy. This acute cancer education, where effective prevention screening education can substantially reduce morbidity mortality. Methods: We...

10.48550/arxiv.2401.15043 preprint EN arXiv (Cornell University) 2024-01-26

MultiLS: A Multi-task Lexical Simplification Framework

OPENALEX - Publications

Kai North Tharindu Ranasinghe Matthew Shardlow Marcos Zampieri

Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is precursor Text with the aim of improving text accessibility various target demographics, including children, second language learners, individuals reading disabilities or low literacy. Several datasets exist LS. These specialize on one two sub-tasks within pipeline. However, as this moment, no single dataset has been developed that covers...

10.48550/arxiv.2402.14972 preprint EN arXiv (Cornell University) 2024-02-22

Native Language Identification in Texts: A Survey

OPENALEX - Publications

Dhiman Goswami Sharanya Thilagan Kai North Shervin Malmasi Marcos Zampieri

10.18653/v1/2024.naacl-long.173 article EN 2024-01-01

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

OPENALEX - Publications

Tharindu Ranasinghe Kai North Damith Premasiri Marcos Zampieri

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable identifying such automatically. With the goal carrying out fair evaluation these systems, several international competitions have been organized, providing community with important benchmark data and methods various languages. Organized since 2019, HASOC (Hate Speech Offensive Content Identification) shared task is one initiatives. In its...

10.48550/arxiv.2211.10163 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

OPENALEX - Publications

Horacio Saggion Sanja Štajner Daniel Ferrés Kim Cheng Sheang Matthew Shardlow and 2 more

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part Workshop Text Simplification, Accessibility, and Readability held in conjunction with EMNLP 2022. The called Natural Language Processing research community to contribute methods advance state art simplification for English, Portuguese, Spanish. A total 14 teams submitted results their systems provided test data. Results indicate new benchmarks Lexical Simplification English quantitative...

10.48550/arxiv.2302.02888 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Findings of the VarDial Evaluation Campaign 2023

OPENALEX - Publications

Noëmi Aepli Çağrı Çöltekin Rob van der Goot Tommi Jauhiainen Mourhaf Kazzaz and 5 more

This report presents the results of shared tasks organized as part VarDial Evaluation Campaign 2023. The campaign is tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL Three separate were included this year: Slot intent detection low-resource language varieties (SID4LR), Discriminating Between Languages -- True Labels (DSL-TL), Speech (DSL-S). All three first time year.

10.48550/arxiv.2305.20080 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Coming Soon ...