NFDI4DS | UHH-SEMS - Publication Details

Varun Kumar

ORCID: 0009-0000-3961-8439

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100432990

Research Areas

Topic Modeling
Natural Language Processing Techniques
Software Engineering Research
Machine Learning and Data Classification
Speech Recognition and Synthesis
Text Readability and Simplification
Organizational Management and Leadership
Adversarial Robustness in Machine Learning
Scientific Computing and Data Management
Multi-Agent Systems and Negotiation
Mathematics Education and Pedagogy
Bacillus and Francisella bacterial research
Anomaly Detection Techniques and Applications
Computational and Text Analysis Methods
Speech and dialogue systems
Multimodal Machine Learning Applications
Parallel Computing and Optimization Techniques
Auction Theory and Applications
Imbalanced Data Classification Techniques
Interpreting and Communication in Healthcare
Ethics and Social Impacts of AI
Advanced Malware Detection Techniques
Data-Driven Disease Surveillance
Fractal and DNA sequence analysis
Machine Learning in Healthcare

Indian Institute of Information Technology Vadodara
2024

Chandigarh University
2024

Amazon (United States)
2019-2023

Indian Institute of Technology Jammu
2023

John Brown University
2023

Amazon (Germany)
2021

Tata Consultancy Services (India)
2014-2015

Industry Scale Semi-Supervised Learning for Natural Language Understanding

OPENALEX - Publications

Luoxin Chen Francisco José García‐Peñalvo Varun Kumar He Xie Jianhua Lü

Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.

10.18653/v1/2021.naacl-industry.39 article EN cc-by 2021-01-01

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

OPENALEX - Publications

Yang Cao Yada Pruksachatkun Kai-Wei Chang Rahul Gupta Varun Kumar and 2 more

Yang Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, Aram Galstyan. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 2: Short Papers). 2022.

10.18653/v1/2022.acl-short.62 article EN cc-by 2022-01-01

ReCode: Robustness Evaluation of Code Generation Models

OPENALEX - Publications

Shiqi Wang Zheng Li Haifeng Qian Chenghao Yang Zijian Wang and 9 more

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.773 article EN cc-by 2023-01-01

MYCRUNCHGPT: A LLM ASSISTED FRAMEWORK FOR SCIENTIFIC MACHINE LEARNING

OPENALEX - Publications

Varun Kumar Leonard Gleyzer Adar Kahana Khemraj Shukla George Em Karniadakis

Scientific machine learning (SciML) has advanced recently across many different areas in computational science and engineering. The objective is to integrate data physics seamlessly without the need of employing elaborate computationally taxing assimilation schemes. However, preprocessing, problem formulation, code generation, postprocessing, analysis are still time- consuming may prevent SciML from wide applicability industrial applications digital twin frameworks. Here, we various stages...

10.1615/jmachlearnmodelcomput.2023049518 article EN Journal of Machine Learning for Modeling and Computing 2023-01-01

Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal

OPENALEX - Publications

Umang Gupta Jwala Dhamala Varun Kumar Apurv Verma Yada Pruksachatkun and 5 more

Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan. Findings of the Association for Computational Linguistics: ACL 2022.

10.18653/v1/2022.findings-acl.55 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Multi-lingual Evaluation of Code Generation Models

OPENALEX - Publications

Ben Athiwaratkun Sanjay Krishna Gouda Zijian Wang Xiaopeng Li Yuchen Tian and 20 more

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, MathQA-X. These datasets cover over 10 programming languages are generated using a scalable conversion framework that transpiles prompts test cases from the original Python into corresponding data in target language. Using these benchmarks, we able to assess performance of models multi-lingual fashion, discovered generalization ability language out-of-domain languages, advantages mono-lingual,...

10.48550/arxiv.2210.14868 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study

OPENALEX - Publications

Xiaokai Wei Sujan K. Gonugondla Shiqi Wang Wasi Uddin Ahmad Baishakhi Ray and 11 more

ML-powered code generation aims to assist developers write in a more productive manner by intelligently generating blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of and achieved impressive performance. However, huge number model parameters poses significant challenge their adoption typical software development environment, where developer might use standard laptop or mid-size server develop code. Such cost resources terms...

10.1145/3611643.3616302 article EN 2023-11-30

Efficient Semi-Supervised Learning for Natural Language Understanding by Optimizing Diversity

OPENALEX - Publications

Eunah Cho He Xie John P. Lalor Varun Kumar William M. Campbell

Expanding new functionalities efficiently is an ongoing challenge for single-turn task-oriented dialogue systems. In this work, we explore functionality-specific semi-supervised learning via self-training. We consider methods that augment training data automatically from unlabeled sets in a functionality-targeted manner. addition, examine multiple techniques efficient selection of augmented utterances to reduce time and increase diversity. First, paraphrase detection attempt find utterance...

10.1109/asru46091.2019.9003747 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

A Static Evaluation of Code Completion by Large Language Models

OPENALEX - Publications

Hantian Ding Varun Kumar Yuchen Tian Zijian Wang Rob Kwiatkowski and 5 more

Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 5: Industry Track). 2023.

10.18653/v1/2023.acl-industry.34 article EN cc-by 2023-01-01

The GW/UMD CLPsych 2016 Shared Task System

OPENALEX - Publications

Ayah Zirikly Varun Kumar Philip Resnik

10.18653/v1/w16-0321 article EN 2016-01-01

Resolving Ambiguities in Text-to-Image Generative Models

OPENALEX - Publications

Ninareh Mehrabi Palash Goyal Apurv Verma Jwala Dhamala Varun Kumar and 5 more

Ninareh Mehrabi, Palash Goyal, Apurv Verma, Jwala Dhamala, Varun Kumar, Qian Hu, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Rahul Gupta. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.804 article EN cc-by 2023-01-01

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

OPENALEX - Publications

Yang Trista Cao Yada Pruksachatkun Kai-Wei Chang Rahul Gupta Varun Kumar and 2 more

Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These can be roughly categorized into two categories: 1) \emph{extrinsic metrics} for evaluating downstream applications and 2) \emph{intrinsic estimating upstream contextualized representation models. In this paper, we conduct an extensive correlation study between intrinsic extrinsic across bias notions using 19 We find that do not necessarily correlate their original setting, even when...

10.48550/arxiv.2203.13928 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models

OPENALEX - Publications

Ninareh Mehrabi Palash Goyal Apurv Verma Jwala Dhamala Varun Kumar and 5 more

Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans handle effectively by asking clarifying questions and/or relying on contextual cues common-sense knowledge, resolving be notoriously hard for machines. In this work, we study arise in text-to-image generative models. We curate a benchmark dataset covering different types of occur these systems. then propose framework mitigate the prompts given systems soliciting clarifications...

10.48550/arxiv.2211.12503 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Automatically augmenting learning material with practical questions to increase its relevance

OPENALEX - Publications

Gaurav Singh Varun Kumar Savita Bhat Niranjan Pedanekar

Relevance of a concept being taught to the real world is believed contribute an increase in intrinsic motivation and engagement learner. Such relevance often found lacking learning material such as textbooks. Practical issues problems one could face while or implementing new concepts are means establishing relevance. In this paper, we propose method automatically augment with practical questions about learnt. We use answers from StackOverflow, leading social Questions Answers (Q&A) website...

10.1109/fie.2015.7344369 article EN 2021 IEEE Frontiers in Education Conference (FIE) 2015-10-01

Challenges and Future Trends of Cryptography in Internet of Things

OPENALEX - Publications

Varun Kumar

The incorporation of cryptographic techniques is crucial for guaranteeing data privacy and security processed additionally sent inside IOT ecosystems, particularly as the keeps growing. Examining problems including resource limitations, scalability, dynamic nature environments, this research paper explores complex obstacles that solutions confront considering IOT. Lightweight cryptography, post-quantum blockchain integration are some new trends future prospects in examined study an effort to...

10.55041/ijsrem30505 article EN INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 2024-04-10

Fewer Truncations Improve Language Modeling

OPENALEX - Publications

Hantian Ding Zijian Wang Giovanni Paolini Varun Kumar Anoop Deoras and 2 more

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many incomplete pieces, leading excessive truncations that hinder from learning compose logically coherent factually consistent content is grounded on complete context. To address issue, we propose Best-fit Packing, a scalable efficient...

10.48550/arxiv.2404.10830 preprint EN arXiv (Cornell University) 2024-04-16

Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies

OPENALEX - Publications

Junlin Wang Siddhartha Jain Dejiao Zhang Baishakhi Ray Varun Kumar and 1 more

A diverse array of reasoning strategies has been proposed to elicit the capabilities large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: increased effectiveness due additional compute. By overlooking aspect, skewed view strategy efficiency is often presented. This paper introduces framework incorporates compute budget into evaluation, providing more informative comparison takes account both and...

10.48550/arxiv.2406.06461 preprint EN arXiv (Cornell University) 2024-06-10

On Mitigating Code LLM Hallucinations with API Documentation

OPENALEX - Publications

Nihal Jain Robert Kwiatkowski Baishakhi Ray Murali Ramanathan Varun Kumar

In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure hallucination occurrences. CloudAPIBench also provides annotations for frequencies occurrences public domain, allowing us study at frequency levels. Our findings reveal that Code LLMs struggle with low APIs: e.g., GPT-4o achieves only 38.58% valid invocations. demonstrate Documentation Augmented Generation (DAG) significantly...

10.48550/arxiv.2407.09726 preprint EN arXiv (Cornell University) 2024-07-12

Learning Code Preference via Synthetic Evolution

OPENALEX - Publications

Jiawei Liu Thanh Thi Nguyen Mingyue Shang Hantian Ding Xiaopeng Li and 3 more

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of preference learning: (i) How do train models to predict meaningful for code? (ii) human LLM align verifiable tastes? To end, propose CodeFavor, a framework training pairwise from synthetic evolution data,...

10.48550/arxiv.2410.03837 preprint EN arXiv (Cornell University) 2024-10-04

Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning

OPENALEX - Publications

Yifeng Ding Hantian Ding Shiqi Wang Qing Sun Varun Kumar and 1 more

Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing given both left and right contexts. However, the current FIM training paradigm, which reorders original sequences then performs regular next-token prediction (NTP), often leads models struggling generate content that aligns smoothly with surrounding context. Crucially, while existing works rely on rule-based post-processing circumvent this weakness, such methods are not practically usable in...

10.48550/arxiv.2410.03103 preprint EN arXiv (Cornell University) 2024-10-03

Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies

OPENALEX - Publications

Junlin Wang Siddhartha Jain Dejiao Zhang Baishakhi Ray Varun Kumar and 1 more

10.18653/v1/2024.emnlp-main.1112 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Custom ANN: An Approach for Efficient Modulation Classification

OPENALEX - Publications

Varun Kumar Deepika Gupta

10.1109/tencon61640.2024.10903085 article EN TENCON 2021 - 2021 IEEE Region 10 Conference (TENCON) 2024-12-01

Coming Soon ...