- Natural Language Processing Techniques
- Topic Modeling
- Artificial Intelligence in Games
- Speech and dialogue systems
- Authorship Attribution and Profiling
- Machine Learning in Bioinformatics
- Text and Document Classification Technologies
- Second Language Acquisition and Learning
- Edcuational Technology Systems
- Multimodal Machine Learning Applications
- Lexicography and Language Studies
- Linguistic Variation and Morphology
- Handwritten Text Recognition Techniques
- Advanced Text Analysis Techniques
- Linguistics and Language Analysis
- Syntax, Semantics, Linguistic Variation
Waseda University
2018-2024
University of Indonesia
2013-2014
We describe our work on designing a linguistically principled part of speech (POS) tagset for the Indonesian language. The process involves detailed study and analysis existing tagsets manual tagging an corpus. results this are POS consisting 23 tags corpus over 250.000 lexical tokens that have been manually tagged using tagset.
This paper describes work on a part-of-speech tagger for the Indonesian language by employing rule-based approach. The system tokenizes documents while also considering multi-word expressions and recognizes named entities. It then applies tags to every token, starting from closed-class words open-class disambiguates based set of manually defined rules. currently obtains an accuracy 79% tagged corpus roughly 250.000 tokens.
Morphological segmentation is useful for processing Mongolian. In this paper, we manually build a morphological data set We then present character-based encoder-decoder model with attention mechanism to perform the task. further investigate influence of analogy features extracted from scratch and improve performance our using multi languages setting. Experimental results show that provides strong baseline Mongolian segmentation. The provide information system. use shows capability acquire...
This paper presents the system submitted by IPS-WASEDA University for CoNLL-SIGMORPHON 2018 Shared Task 1: Type level inflection.We develop a based on holistic approach which considers wholeword form as unit, instead of breaking them into smaller pieces (e,g.morphemes) like baseline systems does.We also implement an encoder-decoder model has recently become new standard in many natural language processing (NLP) tasks.The results show that neural outperforms and our bigger resources...
Morphological generation is a task where given lemma and morphosyntactic description of the target form, we are asked to generate form. Knowing that syntactic semantic relations other forms reflected by word form itself, show how exploit these between forms, holistically, is, as whole, derive without even breaking them into morphemes. Experimental results organising lexica analogical grids able improve accuracy morphological up 8% in low data scenarios. Our holistic approach always performs...
In this paper, we inspect the theoretical problem of counting number analogies between sentences contained in a text. Based on this, measure analogical density We focus analogy at sentence level, based level form rather than semantics. Experiments are carried two different corpora six European languages known to have various levels morphological richness. Corpora tokenised using several tokenisation schemes: character, sub-word and word. For scheme, employ popular models: unigram language...
This paper describes work on a poetry generator that is capable of generating poems in Indonesian based certain contexts by employing constraint satisfaction approach. The system retrieves language resources such as templates and slot fillers combines them to instantiate lines, which turn are composed into set given constraints. output this was evaluated through an online questionnaire involving 180 respondents. results showed generated using the full constraints were consistently measured...
Indonesian as an agglutinating language is known for its derivative morphological richness. Word forms are constructed by combining stem and affixes. In this paper, we study the influence of surface form information in analogical grids extracted from a set word with varying sizes. Each represented feature vector. experiment setting, consider three features: characters, affixes, morphosyntactic definition. The sizes saturation then observed to characterize grids.