- Topic Modeling
- Natural Language Processing Techniques
- Software Engineering Research
- Machine Learning and Data Classification
- Speech Recognition and Synthesis
- Text Readability and Simplification
- Organizational Management and Leadership
- Adversarial Robustness in Machine Learning
- Scientific Computing and Data Management
- Multi-Agent Systems and Negotiation
- Mathematics Education and Pedagogy
- Bacillus and Francisella bacterial research
- Anomaly Detection Techniques and Applications
- Computational and Text Analysis Methods
- Speech and dialogue systems
- Multimodal Machine Learning Applications
- Parallel Computing and Optimization Techniques
- Auction Theory and Applications
- Imbalanced Data Classification Techniques
- Interpreting and Communication in Healthcare
- Ethics and Social Impacts of AI
- Advanced Malware Detection Techniques
- Data-Driven Disease Surveillance
- Fractal and DNA sequence analysis
- Machine Learning in Healthcare
Indian Institute of Information Technology Vadodara
2024
Chandigarh University
2024
Amazon (United States)
2019-2023
Indian Institute of Technology Jammu
2023
John Brown University
2023
Amazon (Germany)
2021
Tata Consultancy Services (India)
2014-2015
Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.
Yang Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, Aram Galstyan. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 2: Short Papers). 2022.
Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
Scientific machine learning (SciML) has advanced recently across many different areas in computational science and engineering. The objective is to integrate data physics seamlessly without the need of employing elaborate computationally taxing assimilation schemes. However, preprocessing, problem formulation, code generation, postprocessing, analysis are still time- consuming may prevent SciML from wide applicability industrial applications digital twin frameworks. Here, we various stages...
Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan. Findings of the Association for Computational Linguistics: ACL 2022.
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, MathQA-X. These datasets cover over 10 programming languages are generated using a scalable conversion framework that transpiles prompts test cases from the original Python into corresponding data in target language. Using these benchmarks, we able to assess performance of models multi-lingual fashion, discovered generalization ability language out-of-domain languages, advantages mono-lingual,...
ML-powered code generation aims to assist developers write in a more productive manner by intelligently generating blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of and achieved impressive performance. However, huge number model parameters poses significant challenge their adoption typical software development environment, where developer might use standard laptop or mid-size server develop code. Such cost resources terms...
Expanding new functionalities efficiently is an ongoing challenge for single-turn task-oriented dialogue systems. In this work, we explore functionality-specific semi-supervised learning via self-training. We consider methods that augment training data automatically from unlabeled sets in a functionality-targeted manner. addition, examine multiple techniques efficient selection of augmented utterances to reduce time and increase diversity. First, paraphrase detection attempt find utterance...
Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 5: Industry Track). 2023.
Ninareh Mehrabi, Palash Goyal, Apurv Verma, Jwala Dhamala, Varun Kumar, Qian Hu, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Rahul Gupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These can be roughly categorized into two categories: 1) \emph{extrinsic metrics} for evaluating downstream applications and 2) \emph{intrinsic estimating upstream contextualized representation models. In this paper, we conduct an extensive correlation study between intrinsic extrinsic across bias notions using 19 We find that do not necessarily correlate their original setting, even when...
Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans handle effectively by asking clarifying questions and/or relying on contextual cues common-sense knowledge, resolving be notoriously hard for machines. In this work, we study arise in text-to-image generative models. We curate a benchmark dataset covering different types of occur these systems. then propose framework mitigate the prompts given systems soliciting clarifications...
Relevance of a concept being taught to the real world is believed contribute an increase in intrinsic motivation and engagement learner. Such relevance often found lacking learning material such as textbooks. Practical issues problems one could face while or implementing new concepts are means establishing relevance. In this paper, we propose method automatically augment with practical questions about learnt. We use answers from StackOverflow, leading social Questions Answers (Q&A) website...
The incorporation of cryptographic techniques is crucial for guaranteeing data privacy and security processed additionally sent inside IOT ecosystems, particularly as the keeps growing. Examining problems including resource limitations, scalability, dynamic nature environments, this research paper explores complex obstacles that solutions confront considering IOT. Lightweight cryptography, post-quantum blockchain integration are some new trends future prospects in examined study an effort to...
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many incomplete pieces, leading excessive truncations that hinder from learning compose logically coherent factually consistent content is grounded on complete context. To address issue, we propose Best-fit Packing, a scalable efficient...
A diverse array of reasoning strategies has been proposed to elicit the capabilities large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: increased effectiveness due additional compute. By overlooking aspect, skewed view strategy efficiency is often presented. This paper introduces framework incorporates compute budget into evaluation, providing more informative comparison takes account both and...
In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure hallucination occurrences. CloudAPIBench also provides annotations for frequencies occurrences public domain, allowing us study at frequency levels. Our findings reveal that Code LLMs struggle with low APIs: e.g., GPT-4o achieves only 38.58% valid invocations. demonstrate Documentation Augmented Generation (DAG) significantly...
Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of preference learning: (i) How do train models to predict meaningful for code? (ii) human LLM align verifiable tastes? To end, propose CodeFavor, a framework training pairwise from synthetic evolution data,...
Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing given both left and right contexts. However, the current FIM training paradigm, which reorders original sequences then performs regular next-token prediction (NTP), often leads models struggling generate content that aligns smoothly with surrounding context. Crucially, while existing works rely on rule-based post-processing circumvent this weakness, such methods are not practically usable in...