- Natural Language Processing Techniques
- Language and cultural evolution
- Linguistic Variation and Morphology
- Linguistics and Cultural Studies
- China's Ethnic Minorities and Relations
- Linguistics, Language Diversity, and Identity
- Phonetics and Phonology Research
- Topic Modeling
- Linguistics and language evolution
- Lexicography and Language Studies
- Semantic Web and Ontologies
- Digital Humanities and Scholarship
- Authorship Attribution and Profiling
- linguistics and terminology studies
- Spanish Linguistics and Language Studies
- Historical Linguistics and Language Studies
- Language, Linguistics, Cultural Analysis
- Speech Recognition and Synthesis
- Multilingual Education and Policy
- Syntax, Semantics, Linguistic Variation
- Chinese history and philosophy
- Computational and Text Analysis Methods
- Hearing Impairment and Communication
- Scientific Computing and Data Management
- Neurobiology of Language and Bilingualism
Max Planck Institute for Evolutionary Anthropology
2021-2025
University of Passau
2023-2025
Aristotle University of Thessaloniki
2023
Trinity College Dublin
2023
Kobe City University of Foreign Studies
2023
Macquarie University
2022-2023
The University of Texas at Austin
2022-2023
University of Hawaiʻi at Mānoa
2022-2023
University of Colorado System
2022-2023
University of Colorado Boulder
2022-2023
Many human languages have words for emotions such as "anger" and "fear," yet it is not clear whether these similar meanings across languages, or why their might vary. We estimate emotion semantics a sample of 2474 spoken using "colexification"-a phenomenon in which name semantically related concepts with the same word. Analyses show significant variation networks concept colexification, predicted by geographic proximity language families. also find evidence universal structure colexification...
Significance Given its size and geographical extension, Sino-Tibetan is of the highest importance for understanding prehistory East Asia, neighboring language families. Based on a dataset 50 languages, we infer phylogenies that date origin family to around 7200 B.P., linking with late Cishan early Yangshao cultures.
Humans have been using language for millennia but only just begun to scratch the surface of what natural can reveal about mind. Here we propose that offers a unique window into psychology. After briefly summarizing legacy analyses in psychological science, show how methodological advances made these more feasible and insightful than ever before. In particular, describe two forms analysis-natural-language processing comparative linguistics-are contributing understand topics as diverse...
This paper describes a computerized alternative to glottochronology for estimating elapsed time since parent languages diverged into daughter languages. The method, developed by the Automated Similarity Judgment Program (ASJP) consortium, is different from in four major respects: (1) it automated and thus more objective, (2) applies uniform analytical approach single database of worldwide languages, (3) based on lexical similarity as determined Levenshtein (edit) distances rather than...
The amount of data from languages spoken all over the world is rapidly increasing. Traditional manual methods in historical linguistics need to face challenges brought by this influx data. Automatic approaches word comparison could provide invaluable help pre-analyze which can be later enhanced experts. In way, computational take care repetitive and schematic tasks leaving experts concentrate on answering interesting questions. Here we test potential automatic detect etymologically related...
Abstract Advances in computer-assisted linguistic research have been greatly influential reshaping research. With the increasing availability of interconnected datasets created and curated by researchers, more interwoven questions can now be investigated. Such advances, however, are bringing high requirements terms rigorousness for preparing curating datasets. Here we present CLICS, a Database Cross-Linguistic Colexifications (CLICS). CLICS tackles interdisciplinary about colexification...
Abstract The amount of available digital data for the languages world is constantly increasing. Unfortunately, most are provided in a large variety formats and therefore not amenable comparison re-use. Cross-Linguistic Data Formats initiative proposes new standards two basic types historical typological language (word lists, structural datasets) framework to incorporate more (e.g. parallel texts, dictionaries). specification cross-linguistic comes along with software package validation...
The past decades have seen substantial growth in digital data on the world's languages. At same time, demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions human prehistory, cultural evolution, and cognition. Unfortunately, most published lack standardization which makes their comparison difficult. Here, we present a new approach increase comparability of lexical data. We designed workflows computer-assisted lifting...
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using complex workflow that relies on manually comparing individual languages. Large-scale standardized collections multilingual wordlists and grammatical structures might help improve this open new avenues for developing automated workflows. Here, we present neural network models use lexical data from worldwide sample more than 1,000 with known affiliations classify into...
Language evolution is traditionally described in terms of family trees with ancestral languages splitting into descendent languages. However, it has long been recognized that language also entails horizontal components, most commonly through lexical borrowing. For example, the English was heavily influenced by Old Norse and French; eight per cent its basic vocabulary borrowed. Borrowing a distinctly non-tree-like process—akin to gene transfer genome evolution—that cannot be recovered...
Abstract The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation cross-linguistic colexification patterns. In its current form, it proven to be useful tool various kinds investigation into semantic associations, ranging from studies on change, patterns conceptualization, and linguistic paleontology. But CLICS also been criticized obvious shortcomings, underlying dataset, which still contains many errors, up...
The Uto-Aztecan language family is one of the largest families in Americas. However, there has been considerable debate about its origin and how it spread. Here we use Bayesian phylogenetic methods to analyze lexical data from thirty-four varieties two Kiowa-Tanoan languages. We infer age Proto-Uto-Aztecan be around 4,100 years (3,258–5,025 years) identify most likely homeland near what now Southern California. reconstruct probable subsistence strategy ancestral society no casual or...
Abstract Contrary to what non-practitioners might expect, the systems of phonetic notation used by linguists are highly idiosyncratic. Not only do various linguistic subfields disagree on specific symbols they use denote speech sounds languages, but also in large databases sound inventories considerable variation can be found. Inspired recent efforts link cross-linguistic data with help reference catalogues (Glottolog, Concepticon) across different resources, we present initial a catalogue...
Lexical borrowing, the transfer of words from one language to another, is most frequent processes in evolution. In order detect borrowings, linguists make use various strategies, combining evidence sources. Despite increasing popularity computational approaches comparative linguistics, automated lexical borrowing detection are still their infancy, disregarding many aspects that routinely considered by human experts. One example for this kind phonological and phonotactic clues especially...
The paper presents the Etymological DICtionary ediTOR (EDICTOR), a free, interactive, web-based tool designed to aid historical linguists in creating, editing, analysing, and publishing etymological datasets. EDICTOR offers interactive solutions for important tasks linguistics, including facilitated input segmentation of phonetic transcriptions, quantitative qualitative analyses morphological data, enhanced interfaces cognate class assignment multiple word alignment, automated evaluation...
Improved computational models of sound change shed light on the history Tukanoan languages * There has been much debate regarding internal during last four decades, with different classification proposals being based lexical and phonological data.Here, we present a new language family an improved approach which infers phylogenetic trees from proposed patterns.In contrast to traditional methods manual identification shared innovations by experts, our method identifies valid within parsimony...
Taraka Rama, Johann-Mattis List, Johannes Wahle, Gerhard Jäger. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which time consuming and yet only available for a small number well-studied language families. Automatizing this step will greatly expand the empirical scope methods linguistics, raw wordlists (in phonetic transcription) are much easier to obtain than cognate have been fully...
Abstract The use of computational methods in comparative linguistics is growing popularity. increasing deployment such draws into focus those areas which they remain inadequate as well where classical approaches to language comparison are untransparent and inconsistent. In this paper we illustrate specific challenges both encounter when studying South-East Asian languages. With the help data from Burmish family point resulting missing annotation standards insufficient for analysis how tackle...