Kai North

ORCID: 0000-0002-9970-2402
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Text Readability and Simplification
  • Topic Modeling
  • Hate Speech and Cyberbullying Detection
  • Authorship Attribution and Profiling
  • Interpreting and Communication in Healthcare
  • Swearing, Euphemism, Multilingualism
  • Names, Identity, and Discrimination Research
  • Cybercrime and Law Enforcement Studies
  • Reinforcement Learning in Robotics
  • Health Literacy and Information Accessibility
  • Linguistics, Language Diversity, and Identity
  • Second Language Acquisition and Learning
  • Bullying, Victimization, and Aggression
  • Translation Studies and Practices
  • Advanced Malware Detection Techniques

George Mason University
2022-2025

Rochester Institute of Technology
2021-2022

Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023.

10.18653/v1/2023.vardial-1.25 article EN cc-by 2023-01-01

In recent years, the spread of online offensive content has become great concern, motivating researchers to develop robust systems capable identifying such automatically. To carry out a fair evaluation these systems, several international shared tasks have been organized, providing community with essential benchmark data and methods for various languages. Organized since 2019, HASOC (Hate Speech Offensive Content Identification) task is one initiatives. its fourth iteration, 2022 included...

10.1145/3574318.3574326 article EN 2022-12-09

Abstract Recent developments in the use of large‐language models have led to substantial improvements accuracy content‐based automated scoring free‐text responses. The reported levels suggest that systems could widespread applicability assessment. However, before they are used operational testing, other aspects their performance warrant examination. In this study, we explore potential for examinees inflate scores by gaming ACTA system. We a range strategies including responding with words...

10.1111/jedm.12427 article EN Journal of Educational Measurement 2025-02-20

Even in highly-developed countries, as many 15-30% of the population can only understand texts written using a basic vocabulary. Their understanding everyday is limited, which prevents them from taking an active role society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification natural language processing task that aims to make text understandable everyone by replacing complex vocabulary expressions with simpler ones, while...

10.3389/frai.2022.991242 article EN cc-by Frontiers in Artificial Intelligence 2022-09-22

Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, Matthew Shardlow, Kai North, Marcos Zampieri. Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022). 2022.

10.18653/v1/2022.tsar-1.31 article EN cc-by 2022-01-01

Abstract Lexical Simplification (LS) is the task of substituting complex words within a sentence for simpler alternatives while maintaining sentence’s original meaning. LS lexical component Text (TS) systems with aim improving accessibility to various target populations such as individuals low literacy or reading disabilities. Prior surveys have been published several years before introduction transformers, transformer-based large language models (LLMs), and prompt learning that drastically...

10.1007/s10844-024-00882-9 article EN cc-by Journal of Intelligent Information Systems 2024-09-02

Language identification is an important first step in many IR and NLP applications. Most publicly available language datasets, however, are compiled under the assumption that gold label of each instance determined by where texts retrieved from. Research has shown this a problematic assumption, particularly case very similar languages (e.g., Croatian Serbian) national varieties Brazilian European Portuguese), may contain no distinctive marker particular or variety. To overcome limitation,...

10.48550/arxiv.2303.01490 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Marcos Zampieri, Skye Morgan, Kai North, Tharindu Ranasinghe, Austin Simmmons, Paridhi Khandelwal, Sara Rosenthal, Preslav Nakov. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 2: Short Papers). 2023.

10.18653/v1/2023.acl-short.66 article EN cc-by 2023-01-01

We discover sizable differences between the lexical complexity assignments of first language (L1) and second (L2) English speakers. The 940 shared tokens without context were extracted compared from three prediction (LCP) datasets: CompLex dataset, Word Complexity Lexicon, CERF-J wordlist. It was found that word frequency, length, syllable count, familiarity, prevalence as well a number derivations had greater effect on perceived for L2 speakers than they did L1 explain these findings in...

10.3389/frai.2023.1236963 article EN cc-by Frontiers in Artificial Intelligence 2023-11-30

This paper describes team LCP-RIT’s submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), English multi-domain dataset in which words context were annotated respect their complexity using a five point Likert scale. Our system uses logistic regression and wide range linguistic features (e.g. psycholinguistic features, n-grams, word frequency, POS tags) predict single...

10.18653/v1/2021.semeval-1.67 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2021-01-01

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature in context along their candidate substitutions. continue improving performance we introduce ALEXSIS-PT, a novel multi-candidate dataset Brazilian Portuguese containing 9,605...

10.48550/arxiv.2209.09034 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of task is automatically provide a set candidate substitutions for complex words in context. organizers provided participants with ALEXSIS manually annotated dataset instances split between small trial dozen each three languages competition (English, Portuguese, Spanish) and test over 300 aforementioned languages. To cope lack training data, had either use alternative data...

10.18653/v1/2022.tsar-1.30 article EN cc-by 2022-01-01

Identifying complex words in texts is an important first step text simplification (TS) systems. In this paper, we investigate the performance of binary comparative Lexical Complexity Prediction (LCP) models applied to a popular benchmark dataset — CompLex 2.0 used SemEval-2021 Task 1. With data from 2.0, create new contain 1,940 sentences referred as CompLex-BC. Using CompLex-BC, train multiple differentiate which two target more or less same sentence. A linear SVM model achieved best our...

10.18653/v1/2022.bea-1.24 article EN cc-by 2022-01-01

Lexical Simplification (LS) is the task of replacing complex for simpler words in a sentence whilst preserving sentence's original meaning. LS lexical component Text (TS) with aim making texts more accessible to various target populations. A past survey (Paetzold and Specia, 2017) has provided detailed overview LS. Since this survey, however, AI/NLP community been taken by storm recent advances deep learning, particularly introduction large language models (LLM) prompt learning. The high...

10.48550/arxiv.2305.12000 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of original sentence. The TSAR-2022 shared task on LS provided participants multilingual lexical test set. It contained nearly 1,200 complex in English, Portuguese, and Spanish presented multiple candidate substitutions each word. competition did not make training data available; therefore, teams had use either...

10.18653/v1/2023.bea-1.33 article EN cc-by 2023-01-01

Objective: The reading level of health educational materials significantly influences information understandability and accessibility, particularly for minoritized populations. Many patient resources surpass the complexity widely accepted standards. There is a critical need high-performing text simplification models in to enhance dissemination literacy. This acute cancer education, where effective prevention screening education can substantially reduce morbidity mortality. Methods: We...

10.48550/arxiv.2401.15043 preprint EN arXiv (Cornell University) 2024-01-26

Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is precursor Text with the aim of improving text accessibility various target demographics, including children, second language learners, individuals reading disabilities or low literacy. Several datasets exist LS. These specialize on one two sub-tasks within pipeline. However, as this moment, no single dataset has been developed that covers...

10.48550/arxiv.2402.14972 preprint EN arXiv (Cornell University) 2024-02-22

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable identifying such automatically. With the goal carrying out fair evaluation these systems, several international competitions have been organized, providing community with important benchmark data and methods various languages. Organized since 2019, HASOC (Hate Speech Offensive Content Identification) shared task is one initiatives. In its...

10.48550/arxiv.2211.10163 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part Workshop Text Simplification, Accessibility, and Readability held in conjunction with EMNLP 2022. The called Natural Language Processing research community to contribute methods advance state art simplification for English, Portuguese, Spanish. A total 14 teams submitted results their systems provided test data. Results indicate new benchmarks Lexical Simplification English quantitative...

10.48550/arxiv.2302.02888 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

This report presents the results of shared tasks organized as part VarDial Evaluation Campaign 2023. The campaign is tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL Three separate were included this year: Slot intent detection low-resource language varieties (SID4LR), Discriminating Between Languages -- True Labels (DSL-TL), Speech (DSL-S). All three first time year.

10.48550/arxiv.2305.20080 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01
Coming Soon ...