NFDI4DS | UHH-SEMS - Publication Details

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

OPENALEX - Publications

Yunhui Zheng Saurabh Pujar Burn Lewis Luca Buratti Edward S. Epstein and 4 more

Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models programming languages opens new possibilities when applied analysis. However, existing datasets train identification suffer from multiple limitations such limited bug context, size, synthetic unrealistic source We propose D2A, a...

10.1109/icse-seip52600.2021.00020 article EN 2021-05-01

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

OPENALEX - Publications

Ruchir Puri David S. Kung Geert Janssen Wei Zhang Giacomo Domeniconi and 12 more

Over the last several decades, software has been woven into fabric of every aspect our society. As development surges and code infrastructure enterprise applications ages, it is now more critical than ever to increase productivity modernize legacy applications. Advances in deep learning machine algorithms have enabled numerous breakthroughs, motivating researchers leverage AI techniques improve efficiency. Thus, fast-emerging research area for Code garnered new interest gathered momentum. In...

10.48550/arxiv.2105.12655 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Exploring Software Naturalness through Neural Language Models

OPENALEX - Publications

Luca Buratti Saurabh Pujar Mihaela Bornea Jason S. McCarley Yunhui Zheng and 6 more

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this use of a pre-trained transformer-based model to perform code analysis tasks. Present approaches depend heavily on features derived from Abstract Syntax Tree (AST) while our models work raw source code. This is first investigate whether such discover AST automatically. To achieve this, we introduce sequence labeling task...

10.48550/arxiv.2006.12641 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

OPENALEX - Publications

Yangruibo Ding Luca Buratti Saurabh Pujar Alessandro Morari Baishakhi Ray and 1 more

Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.436 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

OPENALEX - Publications

Saurabh Pujar Yunhui Zheng Luca Buratti Burn Lewis Y. Q. Chen and 6 more

Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models learn from programming language data opens new possibilities reducing positives when applied analysis. However, existing datasets train identification suffer multiple limitations such limited bug context, size, synthetic...

10.1007/s10664-023-10405-9 article EN cc-by Empirical Software Engineering 2024-02-22

Invited: Automated Code generation for Information Technology Tasks in YAML through Large Language Models

OPENALEX - Publications

Saurabh Pujar Luca Buratti Xiaojie Guo Nicolas Dupuis Burn Lewis and 6 more

The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as ones used for IT Automation, received far less attention, despite involving many active developers and being an essential component modern cloud platforms. This work focuses on Ansible YAML, a widely markup Automation. We present Wisdom, natural-language YAML tool, aimed at improving automation...

10.1109/dac56929.2023.10247987 article EN 2023-07-09

Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code

OPENALEX - Publications

Nicolas Dupuis Luca Buratti Sanjay Vishwakarma Aitana Viudes Forrat David Kremer and 3 more

Code Large Language Models (Code LLMs) have emerged as powerful tools, revolutionizing the software development landscape by automating coding process and reducing time effort required to build applications. This paper focuses on training LLMs specialize in field of quantum computing. We begin discussing unique needs computing programming, which differ significantly from classical programming approaches or languages. A LLM specializing requires a foundational understanding information...

10.48550/arxiv.2405.19495 preprint EN arXiv (Cornell University) 2024-05-29

Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code

OPENALEX - Publications

Nicolas Dupuis Luca Buratti Sanjay Vishwakarma Aitana Viudes Forrat David Kremer and 3 more

10.1109/lad62341.2024.10691762 article EN 2024-06-28

Qiskit HumanEval: An Evaluation Benchmark for Quantum Code Generative Models

OPENALEX - Publications

Sanjay Vishwakarma Francis Harkins Siddharth Golecha Vishal Sharathchandra Bajpe Nicolas Dupuis and 5 more

10.1109/qce60285.2024.00137 article EN 2022 IEEE International Conference on Quantum Computing and Engineering (QCE) 2024-09-15

CONCORD: Clone-Aware Contrastive Learning for Source Code

OPENALEX - Publications

Yangruibo Ding Saikat Chakraborty Luca Buratti Saurabh Pujar Alessandro Morari and 2 more

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic representations valuable many downstream SE tasks, such as clone and bug detection.

10.1145/3597926.3598035 article EN 2023-07-12

On Adam Trained Models and a Parallel Method to Improve the Generalization Performance

OPENALEX - Publications

Guojing Cong Luca Buratti

Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization out-of-sample behavior such gradient methods into question, argued are only marginal value. Notably for many well-known image classification tasks as CIFAR-10 ImageNet-1K, current models with best validation performance still trained SGD manual schedule learning rate reduction. We analyze 7...

10.1109/mlhpc.2018.8638641 article EN 2018-11-01

Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models

OPENALEX - Publications

Sanjay Vishwakarma Francis Harkins Siddharth Golecha Vishal Sharathchandra Bajpe Nicolas Dupuis and 5 more

Quantum programs are typically developed using quantum Software Development Kits (SDKs). The rapid advancement of computing necessitates new tools to streamline this development process, and one such tool could be Generative Artificial intelligence (GenAI). In study, we introduce use the Qiskit HumanEval dataset, a hand-curated collection tasks designed benchmark ability Large Language Models (LLMs) produce code - SDK. This dataset consists more than 100 tasks, each accompanied by prompt,...

10.48550/arxiv.2406.14712 preprint EN arXiv (Cornell University) 2024-06-20

Ansible Lightspeed: A Code Generation Service for IT Automation

OPENALEX - Publications

Priyam Sahoo Saurabh Pujar Ganesh Nalawade Richard Genhardt Louis Mandel and 1 more

10.1145/3691620.3695277 article EN 2024-10-18

Ansible Lightspeed: A Code Generation Service for IT Automation

OPENALEX - Publications

Priyam Sahoo Saurabh Pujar Ganesh Nalawade Richard C. Gebhardt Louis Mandel and 1 more

The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs developers use write software are often used as an interface interact with LLMs. Although many such have been released, almost all them focus on general-purpose programming languages. Domain-specific languages, those crucial for IT automation, not received much attention. Ansible is one YAML-based...

10.48550/arxiv.2402.17442 preprint EN arXiv (Cornell University) 2024-02-27

Varangian

OPENALEX - Publications

Saurabh Pujar Yunhui Zheng Luca Buratti Burn Lewis Alessandro Morari and 3 more

The complexity and scale of modern software programs often lead to overlooked programming errors security vulnerabilities. Developers rely on automatic tools, like static analysis look for bugs Static tools are widely used because they can understand nontrivial program behaviors, millions lines code, detect subtle bugs. However, known generate an excess false alarms which hinder their utilization as it is counterproductive developers go through a long list reported issues, only find few true...

10.1145/3524842.3528516 article EN 2022-05-23

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

OPENALEX - Publications

Yunhui Zheng Saurabh Pujar Burn Lewis Luca Buratti Edward S. Epstein and 4 more

Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models programming languages opens new possibilities when applied analysis. However, existing datasets train identification suffer from multiple limitations such limited bug context, size, synthetic unrealistic source We propose D2A, a...

10.48550/arxiv.2102.07995 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Automated Code generation for Information Technology Tasks in YAML through Large Language Models

OPENALEX - Publications

Saurabh Pujar Luca Buratti Xiaojie Guo Nicolas Dupuis Burn Lewis and 6 more

The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as ones used for IT Automation, have received far less attention, despite involving many active developers and being an essential component modern cloud platforms. This work focuses on Ansible-YAML, a widely markup Automation. We present Ansible Wisdom, natural-language Ansible-YAML tool, aimed at improving...

10.48550/arxiv.2305.02783 preprint EN other-oa arXiv (Cornell University) 2023-01-01

CONCORD: Clone-aware Contrastive Learning for Source Code

OPENALEX - Publications

Yangruibo Ding Saikat Chakraborty Luca Buratti Saurabh Pujar Alessandro Morari and 2 more

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic representations valuable many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different abstractions (e.g., token, AST, graph), we argue that it is also essential factor in how developers day-to-day general-purpose representation learning. On one hand,...

10.48550/arxiv.2306.03234 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

OPENALEX - Publications

Marcus J. Min Yangruibo Ding Luca Buratti Saurabh Pujar Gail E. Kaiser and 2 more

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates performance of LLMs on a set individual tasks, their self-consistency across different tasks overlooked. Intuitively, trustworthy model should be self-consistent when generating natural language specifications for its own code and specifications. Failure to preserve reveals lack understanding shared semantics underlying...

10.48550/arxiv.2310.14053 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

Learning Transfers over Several Programming Languages

OPENALEX - Publications

Razan Baltaji Saurabh Pujar Louis Mandel Martin Hirzel Luca Buratti and 1 more

Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These use two kinds of data: large amounts unlabeled code samples pre-training and relatively smaller labeled fine-tuning or in-context learning. Unfortunately, many languages are low-resource, lacking most tasks often even samples. Therefore, users low-resource (e.g., legacy new languages) miss out on the benefits LLMs. Cross-lingual transfer uses data from a...

10.48550/arxiv.2310.16937 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

OPENALEX - Publications

Yangruibo Ding Luca Buratti Saurabh Pujar Alessandro Morari Baishakhi Ray and 1 more

Understanding the functional (dis)-similarity of source code is significant for modeling tasks such as software vulnerability and clone detection. We present DISCO(DIS-similarity COde), a novel self-supervised model focusing on identifying (dis)similar functionalities code. Different from existing works, our approach does not require huge amount randomly collected datasets. Rather, we design structure-guided transformation algorithms to generate synthetic clones inject real-world security...

10.48550/arxiv.2110.03868 preprint EN other-oa arXiv (Cornell University) 2021-01-01