- Topic Modeling
- Natural Language Processing Techniques
- Semantic Web and Ontologies
- Sentiment Analysis and Opinion Mining
- Advanced Text Analysis Techniques
- Data Quality and Management
- Speech Recognition and Synthesis
- Speech and Audio Processing
- Multimodal Machine Learning Applications
- Software Engineering Research
- Data Mining and Machine Learning Applications
- Edcuational Technology Systems
- Music and Audio Processing
- Biomedical Text Mining and Ontologies
- Text Readability and Simplification
- Machine Learning and Data Classification
- Web Data Mining and Analysis
- Service-Oriented Architecture and Web Services
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Text and Document Classification Technologies
- Expert finding and Q&A systems
- Data Stream Mining Techniques
- Public Health and Nutrition
- Recommender Systems and Techniques
University of Indonesia
2020-2023
Free University of Bozen-Bolzano
2014-2019
Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics. 2023.
Recent knowledge extraction methods are moving towards ternary and higher-arity relations to capture more information about binary facts. An example is include the time, location, duration of a specific fact. These can be even complex extract in advanced domains such as news, where events typically come with different facets including reasons, consequences, purposes, involved parties, related events. The main challenge consists first finding set each fact, second tagging those relevant category.
In its daily use, the Indonesian language is riddled with informality, that is, deviations from standard in terms of vocabulary, spelling, and word order. On other hand, current available NLP models are typically developed mind. this work, we address a style-transfer informal to formal as low resource machine translation problem. We build new dataset parallel sentences counterpart. benchmark several strategies perform style transfer Indonesian. also explore augmenting training set artificial...
News websites give their users the opportunity to participate in discussions about published articles, by writing comments. Typically, these comments are unstructured making it hard understand flow of user discussions. Thus, there is a need for organizing help (1) gain more insights news topics, and (2) have an easy access that trigger interests. In this work, we address above problem around entities aspects they discuss. More specifically, propose approach entity aspect extraction from...
Alham Fikri Aji, Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Radityo Eko Prasojo, Tirana Fatyanosa. Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task. 2021.
Haryo Akbarianto Wibowo, Made Nindyatama Nityasya, Afra Feyza Akyürek, Suci Fitriany, Alham Fikri Aji, Radityo Eko Prasojo, Derry Tanti Wijaya. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Recent advances in Natural Language Processing (NLP) have largely pushed deep transformer-based models as the go-to state-of-the-art technique without much regard to production and utilization cost. Companies planning adopt these methods into their business face difficulties because of lack machine, data, human resources build them. We compare both performance cost classical learning algorithms latest ones common sequence text labeling tasks. In our industrial datasets, we find that often...
Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use neural architecture search but often suffer training costs. We present Nix- TTS, achieved via knowledge distillation to high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) teacher model. Specifically, we offer module-wise distillation, enabling flexible independent the encoder decoder module. The resulting Nix -...
The parallelism of Transformer-based models comes at the cost their input max-length. Some studies proposed methods to overcome this limitation, but none them reported effectiveness summarization as an alternative. In study, we investigate performance document truncation and in text classification tasks. Each two was investigated with several variations. This study also how close performances are full-text. We used a dataset tasks based on Indonesian news articles (IndoSum) do tests. shows...
We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification sequence labeling the Indonesian language. also compare aspects of distillations including usage word embeddings unlabeled data augmentation. experiments show that, despite rising popularity Transformer-based models, using BiLSTM CNN provide best...
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages English Chinese, while it remains inaccessible to many due the unavailability of data resources benchmarks. In this work, we focus developing in Indonesia. being second most linguistically diverse country, Indonesia are categorized endangered some even extinct. We develop...
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data a neural machine translation system to generate paraphrases, hence simple apply. multiple samples using beam search choose the most lexically diverse pair according their sentence BLEU. compare generated with...
Made Nindyatama Nityasya, Haryo Wibowo, Alham Fikri Aji, Genta Winata, Radityo Eko Prasojo, Phil Blunsom, Adhiguna Kuncoro. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
NLP research is impeded by a lack of resources and awareness the challenges presented underrepresented languages dialects. Focusing on spoken in Indonesia, second most linguistically diverse fourth populous nation world, we provide an overview current state for Indonesia's 700+ languages. We highlight Indonesian how these affect performance systems. Finally, general recommendations to help develop technology not only Indonesia but also other
We develop and benchmark a Singlish pretrained neural language model. To this end, we build novel 3 GB freetext dataset collected through various Singaporean websites. Then, leverage ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to train transformer-based is chosen due its resource-efficiency better ensure reproducibility. further two text classification datasets in Singlish: sentiment analysis identification. use the fine-tune our model results...