Varun Kumar

ORCID: 0009-0000-3961-8439
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Software Engineering Research
  • Machine Learning and Data Classification
  • Speech Recognition and Synthesis
  • Text Readability and Simplification
  • Organizational Management and Leadership
  • Adversarial Robustness in Machine Learning
  • Scientific Computing and Data Management
  • Multi-Agent Systems and Negotiation
  • Mathematics Education and Pedagogy
  • Bacillus and Francisella bacterial research
  • Anomaly Detection Techniques and Applications
  • Computational and Text Analysis Methods
  • Speech and dialogue systems
  • Multimodal Machine Learning Applications
  • Parallel Computing and Optimization Techniques
  • Auction Theory and Applications
  • Imbalanced Data Classification Techniques
  • Interpreting and Communication in Healthcare
  • Ethics and Social Impacts of AI
  • Advanced Malware Detection Techniques
  • Data-Driven Disease Surveillance
  • Fractal and DNA sequence analysis
  • Machine Learning in Healthcare

Indian Institute of Information Technology Vadodara
2024

Chandigarh University
2024

Amazon (United States)
2019-2023

Indian Institute of Technology Jammu
2023

John Brown University
2023

Amazon (Germany)
2021

Tata Consultancy Services (India)
2014-2015

Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.

10.18653/v1/2021.naacl-industry.39 article EN cc-by 2021-01-01

Yang Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, Aram Galstyan. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 2: Short Papers). 2022.

10.18653/v1/2022.acl-short.62 article EN cc-by 2022-01-01

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.773 article EN cc-by 2023-01-01

Scientific machine learning (SciML) has advanced recently across many different areas in computational science and engineering. The objective is to integrate data physics seamlessly without the need of employing elaborate computationally taxing assimilation schemes. However, preprocessing, problem formulation, code generation, postprocessing, analysis are still time- consuming may prevent SciML from wide applicability industrial applications digital twin frameworks. Here, we various stages...

10.1615/jmachlearnmodelcomput.2023049518 article EN Journal of Machine Learning for Modeling and Computing 2023-01-01

Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan. Findings of the Association for Computational Linguistics: ACL 2022.

10.18653/v1/2022.findings-acl.55 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, MathQA-X. These datasets cover over 10 programming languages are generated using a scalable conversion framework that transpiles prompts test cases from the original Python into corresponding data in target language. Using these benchmarks, we able to assess performance of models multi-lingual fashion, discovered generalization ability language out-of-domain languages, advantages mono-lingual,...

10.48550/arxiv.2210.14868 preprint EN cc-by arXiv (Cornell University) 2022-01-01

ML-powered code generation aims to assist developers write in a more productive manner by intelligently generating blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of and achieved impressive performance. However, huge number model parameters poses significant challenge their adoption typical software development environment, where developer might use standard laptop or mid-size server develop code. Such cost resources terms...

10.1145/3611643.3616302 article EN 2023-11-30

Expanding new functionalities efficiently is an ongoing challenge for single-turn task-oriented dialogue systems. In this work, we explore functionality-specific semi-supervised learning via self-training. We consider methods that augment training data automatically from unlabeled sets in a functionality-targeted manner. addition, examine multiple techniques efficient selection of augmented utterances to reduce time and increase diversity. First, paraphrase detection attempt find utterance...

10.1109/asru46091.2019.9003747 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 5: Industry Track). 2023.

10.18653/v1/2023.acl-industry.34 article EN cc-by 2023-01-01

Ninareh Mehrabi, Palash Goyal, Apurv Verma, Jwala Dhamala, Varun Kumar, Qian Hu, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Rahul Gupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.804 article EN cc-by 2023-01-01

Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These can be roughly categorized into two categories: 1) \emph{extrinsic metrics} for evaluating downstream applications and 2) \emph{intrinsic estimating upstream contextualized representation models. In this paper, we conduct an extensive correlation study between intrinsic extrinsic across bias notions using 19 We find that do not necessarily correlate their original setting, even when...

10.48550/arxiv.2203.13928 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans handle effectively by asking clarifying questions and/or relying on contextual cues common-sense knowledge, resolving be notoriously hard for machines. In this work, we study arise in text-to-image generative models. We curate a benchmark dataset covering different types of occur these systems. then propose framework mitigate the prompts given systems soliciting clarifications...

10.48550/arxiv.2211.12503 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Relevance of a concept being taught to the real world is believed contribute an increase in intrinsic motivation and engagement learner. Such relevance often found lacking learning material such as textbooks. Practical issues problems one could face while or implementing new concepts are means establishing relevance. In this paper, we propose method automatically augment with practical questions about learnt. We use answers from StackOverflow, leading social Questions Answers (Q&A) website...

10.1109/fie.2015.7344369 article EN 2021 IEEE Frontiers in Education Conference (FIE) 2015-10-01

The incorporation of cryptographic techniques is crucial for guaranteeing data privacy and security processed additionally sent inside IOT ecosystems, particularly as the keeps growing. Examining problems including resource limitations, scalability, dynamic nature environments, this research paper explores complex obstacles that solutions confront considering IOT. Lightweight cryptography, post-quantum blockchain integration are some new trends future prospects in examined study an effort to...

10.55041/ijsrem30505 article EN INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 2024-04-10

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many incomplete pieces, leading excessive truncations that hinder from learning compose logically coherent factually consistent content is grounded on complete context. To address issue, we propose Best-fit Packing, a scalable efficient...

10.48550/arxiv.2404.10830 preprint EN arXiv (Cornell University) 2024-04-16

A diverse array of reasoning strategies has been proposed to elicit the capabilities large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: increased effectiveness due additional compute. By overlooking aspect, skewed view strategy efficiency is often presented. This paper introduces framework incorporates compute budget into evaluation, providing more informative comparison takes account both and...

10.48550/arxiv.2406.06461 preprint EN arXiv (Cornell University) 2024-06-10

In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure hallucination occurrences. CloudAPIBench also provides annotations for frequencies occurrences public domain, allowing us study at frequency levels. Our findings reveal that Code LLMs struggle with low APIs: e.g., GPT-4o achieves only 38.58% valid invocations. demonstrate Documentation Augmented Generation (DAG) significantly...

10.48550/arxiv.2407.09726 preprint EN arXiv (Cornell University) 2024-07-12

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of preference learning: (i) How do train models to predict meaningful for code? (ii) human LLM align verifiable tastes? To end, propose CodeFavor, a framework training pairwise from synthetic evolution data,...

10.48550/arxiv.2410.03837 preprint EN arXiv (Cornell University) 2024-10-04

Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing given both left and right contexts. However, the current FIM training paradigm, which reorders original sequences then performs regular next-token prediction (NTP), often leads models struggling generate content that aligns smoothly with surrounding context. Crucially, while existing works rely on rule-based post-processing circumvent this weakness, such methods are not practically usable in...

10.48550/arxiv.2410.03103 preprint EN arXiv (Cornell University) 2024-10-03

10.18653/v1/2024.emnlp-main.1112 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

10.1109/tencon61640.2024.10903085 article EN TENCON 2021 - 2021 IEEE Region 10 Conference (TENCON) 2024-12-01
Coming Soon ...