NFDI4DS | UHH-SEMS - Publication Details

A Unified Neural Coherence Model

OPENALEX - Publications

Han Cheol Moon Tasnim Mohiuddin Shafiq Joty Chi Xu

Han Cheol Moon, Tasnim Mohiuddin, Shafiq Joty, Chi Xu. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1231 article EN cc-by 2019-01-01

Revisiting Adversarial Autoencoder for Unsupervised Word Translation with Cycle Consistency and Improved Training

OPENALEX - Publications

Tasnim Mohiuddin Shafiq Joty

Tasnim Mohiuddin, Shafiq Joty. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1386 article EN 2019-01-01

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

OPENALEX - Publications

Tasnim Mohiuddin Mehwish Bari Shafiq Joty

Most of the successful and predominant methods for Bilingual Lexicon Induction (BLI) are mapping-based, where a linear mapping function is learned with assumption that word embedding spaces different languages exhibit similar geometric structures (i.e. approximately isomorphic). However, several recent studies have criticized this simplified showing it does not hold in general even closely related languages. In work, we propose novel semi-supervised method to learn cross-lingual embeddings...

10.18653/v1/2020.emnlp-main.215 article EN cc-by 2020-01-01

Fanar: An Arabic-Centric Multimodal Generative AI Platform

OPENALEX - Publications

Fanar Team Ummar Abbas Mohammad Shahmeer Ahmad Firoj Alam Enes Altınışık and 37 more

We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Star Prime, two highly capable Arabic Large Language Models (LLMs) best in class on well established benchmarks similar sized models. is 7B (billion) parameter model was trained from scratch nearly 1 trillion clean deduplicated Arabic, English Code tokens. Prime 9B continually Gemma-2 base same token set. Both models...

10.48550/arxiv.2501.13944 preprint EN arXiv (Cornell University) 2025-01-18

Unsupervised Word Translation with Adversarial Autoencoder

OPENALEX - Publications

Tasnim Mohiuddin Shafiq Joty

Crosslingual word embeddings learned from monolingual have a crucial role in many downstream tasks, ranging machine translation to transfer learning. Adversarial training has shown impressive success learning crosslingual and the associated task without any parallel data by mapping shared space. However, recent work superior performance for non-adversarial methods more challenging language pairs. In this article, we investigate adversarial autoencoder unsupervised propose two novel...

10.1162/coli_a_00374 article EN cc-by-nc-nd Computational Linguistics 2020-03-23

UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP

OPENALEX - Publications

Mehwish Bari Tasnim Mohiuddin Shafiq Joty

M Saiful Bari, Tasnim Mohiuddin, Shafiq Joty. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.154 article EN cc-by 2021-01-01

Data Selection Curriculum for Neural Machine Translation

OPENALEX - Publications

Tasnim Mohiuddin Philipp Koehn Vishrav Chaudhary James H. Cross Shruti Bhosale and 1 more

Neural Machine Translation (NMT) models are typically trained on heterogeneous data that concatenated and randomly shuffled. However, not all of the training equally useful to model. Curriculum aims present NMT in a meaningful order. In this work, we introduce two-stage framework for where fine-tune base model subsets data, selected by both deterministic scoring using pre-trained methods online considers prediction scores emerging Through comprehensive experiments six language pairs...

10.18653/v1/2022.findings-emnlp.113 article EN cc-by 2022-01-01

Modeling Speech Acts in Asynchronous Conversations: A Neural-CRF Approach

OPENALEX - Publications

Shafiq Joty Tasnim Mohiuddin

Participants in an asynchronous conversation (e.g., forum, e-mail) interact with each other at different times, performing certain communicative acts, called speech acts question, request). In this article, we propose a hybrid approach to act recognition conversations. Our works two main steps: long short-term memory recurrent neural network (LSTM-RNN) first encodes sentence separately into task-specific distributed representation, and is then used conditional random field (CRF) model...

10.1162/coli_a_00339 article EN cc-by-nc-nd Computational Linguistics 2018-09-18

AugVic: Exploiting BiText Vicinity for Low-Resource NMT

OPENALEX - Publications

Tasnim Mohiuddin Mehwish Bari Shafiq Joty

The success of Neural Machine Translation (NMT) largely depends on the availability large bitext training corpora.Due to lack such corpora in low-resource language pairs, NMT systems often exhibit poor performance.Extra relevant monolingual data helps, but acquiring it could be quite expensive, especially for languages.Moreover, domain mismatch between (train/test) and might degrade performance.To alleviate issues, we propose AUGVIC, a novel augmentation framework which exploits vicinal...

10.18653/v1/2021.findings-acl.267 article EN cc-by 2021-01-01

Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks

OPENALEX - Publications

Tasnim Mohiuddin Prathyusha Jwalapuram Xiang Lin Shafiq Joty

Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed largely been neglected. With the advancements made by neural approaches such as machine translation (MT), summarization and dialog systems, need of these tasks is now more crucial than ever. However, models typically evaluated only synthetic tasks, may not be representative performance applications. To investigate how use cases, we...

10.18653/v1/2021.eacl-main.308 article EN cc-by 2021-01-01

UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP

OPENALEX - Publications

M Saiful Bari Tasnim Mohiuddin Shafiq Joty

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task language is rare, especially low-resource languages. We propose UXLA, a novel unsupervised augmentation framework zero-resource transfer scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from source distribution an unknown distribution, assuming no training label the language. At its core, performs simultaneous self-training...

10.48550/arxiv.2004.13240 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Adaptation of Hierarchical Structured Models for Speech Act Recognition in Asynchronous Conversation

OPENALEX - Publications

Tasnim Mohiuddin Thanh-Tung Nguyen Shafiq Joty

Tasnim Mohiuddin, Thanh-Tung Nguyen, Shafiq Joty. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1134 article EN 2019-01-01

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

OPENALEX - Publications

Tasnim Mohiuddin M Saiful Bari Shafiq Joty

Most of the successful and predominant methods for bilingual lexicon induction (BLI) are mapping-based, where a linear mapping function is learned with assumption that word embedding spaces different languages exhibit similar geometric structures (i.e., approximately isomorphic). However, several recent studies have criticized this simplified showing it does not hold in general even closely related languages. In work, we propose novel semi-supervised method to learn cross-lingual embeddings...

10.48550/arxiv.2004.13889 preprint EN public-domain arXiv (Cornell University) 2020-01-01

A Unified Neural Coherence Model

OPENALEX - Publications

Han Cheol Moon Tasnim Mohiuddin Shafiq Joty Chi Xu

Recently, neural approaches to coherence modeling have achieved state-of-the-art results in several evaluation tasks. However, we show that most of these models often fail on harder tasks with more realistic application scenarios. In particular, the existing underperform require model be sensitive local contexts such as candidate ranking conversational dialogue and machine translation. this paper, propose a unified incorporates sentence grammar, inter-sentence relations, global patterns into...

10.48550/arxiv.1909.00349 preprint EN cc-by arXiv (Cornell University) 2019-01-01

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

OPENALEX - Publications

Md. Mubtasim Ahasan Md Fahim Tasnim Mohiuddin A K M Mahbubur Rahman Aman Chadha and 4 more

Recent advancements in speech-language models have yielded significant improvements speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of into discrete tokens remains challenging. This process demands acoustic, semantic, contextual information for precise representations. Existing representations generally fall two categories: acoustic from audio codecs semantic self-supervised learning models. Although recent efforts unified improved...

10.48550/arxiv.2410.15017 preprint EN arXiv (Cornell University) 2024-10-19

GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge

OPENALEX - Publications

Shammur Absar Chowdhury Hind Almerekhi Mücahid Kutlu Kaan Efe Keleş Fatema Ahmad and 3 more

This paper presents a comprehensive overview of the first edition Academic Essay Authenticity Challenge, organized as part GenAI Content Detection shared tasks collocated with COLING 2025. challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined follows: "Given an essay, identify whether it generated by machine or authored human.'' involves two languages: English and Arabic. During evaluation phase, 25 teams submitted systems 21...

10.48550/arxiv.2412.18274 preprint EN arXiv (Cornell University) 2024-12-24

ImEW: A Framework for Editing Image in the Wild

OPENALEX - Publications

Tasnim Mohiuddin Tianyi Zhang Maowen Nie Jing Huang Qianqian Chen and 1 more

The ability to edit images in a realistic and visually appealing manner is fundamental requirement various computer vision applications. In this paper, we present ImEW, unified framework designed for solving image editing tasks. ImEW utilizes off-the-shelf foundation models address four essential tasks: object removal, translation, replacement, generative fill beyond the frame. These tasks are accomplished by leveraging capabilities of state-of-the-art models, namely Segment Anything Model,...

10.1145/3607827.3616840 article EN 2023-10-26

Coherence Modeling of Asynchronous Conversations: A Neural Entity Grid Approach

OPENALEX - Publications

Tasnim Mohiuddin Shafiq Joty Dat Tien Nguyen

We propose a novel coherence model for written asynchronous conversations (e.g., forums, emails), and show its applications in assessment thread reconstruction tasks. conduct our research two steps. First, we improvements to the recently proposed neural entity grid by lexicalizing transitions. Then, extend incorporating underlying conversational structure representation feature computation. Our achieves state of art results on standard tasks monologue outperforming existing models. also...

10.48550/arxiv.1805.02275 preprint EN other-oa arXiv (Cornell University) 2018-01-01