- Software Engineering Research
- Advanced Malware Detection Techniques
- Software Reliability and Analysis Research
- Software System Performance and Reliability
- Software Testing and Debugging Techniques
- Topic Modeling
- Scientific Computing and Data Management
- Advanced Text Analysis Techniques
- Computational Physics and Python Applications
- Service-Oriented Architecture and Web Services
- Advanced Software Engineering Methodologies
- Advanced Neural Network Applications
- Forest Insect Ecology and Management
- Insect-Plant Interactions and Control
- Evolutionary Algorithms and Applications
- Machine Learning and Algorithms
- Machine Learning and ELM
- Research on scale insects
- Business Process Modeling and Analysis
- Stochastic Gradient Optimization Techniques
- Web Application Security Vulnerabilities
- Domain Adaptation and Few-Shot Learning
- Machine Learning and Data Classification
- Software Engineering Techniques and Practices
IBM Research - Zurich
2024
IBM Research - Thomas J. Watson Research Center
2024
IBM (United States)
2021
Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models programming languages opens new possibilities when applied analysis. However, existing datasets train identification suffer from multiple limitations such limited bug context, size, synthetic unrealistic source We propose D2A, a...
Over the last several decades, software has been woven into fabric of every aspect our society. As development surges and code infrastructure enterprise applications ages, it is now more critical than ever to increase productivity modernize legacy applications. Advances in deep learning machine algorithms have enabled numerous breakthroughs, motivating researchers leverage AI techniques improve efficiency. Thus, fast-emerging research area for Code garnered new interest gathered momentum. In...
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this use of a pre-trained transformer-based model to perform code analysis tasks. Present approaches depend heavily on features derived from Abstract Syntax Tree (AST) while our models work raw source code. This is first investigate whether such discover AST automatically. To achieve this, we introduce sequence labeling task...
Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Abstract Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models learn from programming language data opens new possibilities reducing positives when applied analysis. However, existing datasets train identification suffer multiple limitations such limited bug context, size, synthetic...
The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as ones used for IT Automation, received far less attention, despite involving many active developers and being an essential component modern cloud platforms. This work focuses on Ansible YAML, a widely markup Automation. We present Wisdom, natural-language YAML tool, aimed at improving automation...
Code Large Language Models (Code LLMs) have emerged as powerful tools, revolutionizing the software development landscape by automating coding process and reducing time effort required to build applications. This paper focuses on training LLMs specialize in field of quantum computing. We begin discussing unique needs computing programming, which differ significantly from classical programming approaches or languages. A LLM specializing requires a foundational understanding information...
Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic representations valuable many downstream SE tasks, such as clone and bug detection.
Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization out-of-sample behavior such gradient methods into question, argued are only marginal value. Notably for many well-known image classification tasks as CIFAR-10 ImageNet-1K, current models with best validation performance still trained SGD manual schedule learning rate reduction. We analyze 7...
Quantum programs are typically developed using quantum Software Development Kits (SDKs). The rapid advancement of computing necessitates new tools to streamline this development process, and one such tool could be Generative Artificial intelligence (GenAI). In study, we introduce use the Qiskit HumanEval dataset, a hand-curated collection tasks designed benchmark ability Large Language Models (LLMs) produce code - SDK. This dataset consists more than 100 tasks, each accompanied by prompt,...
The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs developers use write software are often used as an interface interact with LLMs. Although many such have been released, almost all them focus on general-purpose programming languages. Domain-specific languages, those crucial for IT automation, not received much attention. Ansible is one YAML-based...
The complexity and scale of modern software programs often lead to overlooked programming errors security vulnerabilities. Developers rely on automatic tools, like static analysis look for bugs Static tools are widely used because they can understand nontrivial program behaviors, millions lines code, detect subtle bugs. However, known generate an excess false alarms which hinder their utilization as it is counterproductive developers go through a long list reported issues, only find few true...
Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines code. Despite their popularity, static known to generate an excess false positives. The recent ability Machine Learning models programming languages opens new possibilities when applied analysis. However, existing datasets train identification suffer from multiple limitations such limited bug context, size, synthetic unrealistic source We propose D2A, a...
The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as ones used for IT Automation, have received far less attention, despite involving many active developers and being an essential component modern cloud platforms. This work focuses on Ansible-YAML, a widely markup Automation. We present Ansible Wisdom, natural-language Ansible-YAML tool, aimed at improving...
Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic representations valuable many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different abstractions (e.g., token, AST, graph), we argue that it is also essential factor in how developers day-to-day general-purpose representation learning. On one hand,...
Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates performance of LLMs on a set individual tasks, their self-consistency across different tasks overlooked. Intuitively, trustworthy model should be self-consistent when generating natural language specifications for its own code and specifications. Failure to preserve reveals lack understanding shared semantics underlying...
Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These use two kinds of data: large amounts unlabeled code samples pre-training and relatively smaller labeled fine-tuning or in-context learning. Unfortunately, many languages are low-resource, lacking most tasks often even samples. Therefore, users low-resource (e.g., legacy new languages) miss out on the benefits LLMs. Cross-lingual transfer uses data from a...
Understanding the functional (dis)-similarity of source code is significant for modeling tasks such as software vulnerability and clone detection. We present DISCO(DIS-similarity COde), a novel self-supervised model focusing on identifying (dis)similar functionalities code. Different from existing works, our approach does not require huge amount randomly collected datasets. Rather, we design structure-guided transformation algorithms to generate synthetic clones inject real-world security...