NFDI4DS | UHH-SEMS - Publication Details

Nghi D. Q. Bui

ORCID: 0000-0003-1984-4329

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5107222721

Research Areas

Software Engineering Research
Software Testing and Debugging Techniques
Topic Modeling
Advanced Malware Detection Techniques
Natural Language Processing Techniques
Software System Performance and Reliability
Software Reliability and Analysis Research
Web Data Mining and Analysis
Adversarial Robustness in Machine Learning
Computability, Logic, AI Algorithms
Model-Driven Software Engineering Techniques
Business Process Modeling and Analysis
Sentiment Analysis and Opinion Mining
Multi-Agent Systems and Negotiation
Software Engineering Techniques and Practices
Web Application Security Vulnerabilities
Machine Learning and Data Classification
Semantic Web and Ontologies
Explainable Artificial Intelligence (XAI)
Data Mining Algorithms and Applications
Anomaly Detection Techniques and Applications
Collaboration in agile enterprises
Text Readability and Simplification
Network Security and Intrusion Detection
Advanced Text Analysis Techniques

Fulbright University Vietnam
2023-2024

Singapore Management University
2017-2022

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

Learning code representations has found many uses in software engineering, such as classification, search, comment generation, and bug prediction, etc. Although of tokens, syntax trees, dependency graphs, paths or the combinations their variants have been proposed, existing learning techniques a major limitation that these models are often trained on datasets labeled for specific downstream tasks, may not be suitable other tasks. Even though some generate from unlabeled code, they far being...

10.1109/icse43902.2021.00109 article EN 2021-05-01

On the generalizability of Neural Program Models with respect to semantic-preserving program transformations

OPENALEX - Publications

Md Rafiqul Islam Rabin Nghi D. Q. Bui Ke Wang Yijun Yu Lingxiao Jiang and 1 more

10.1016/j.infsof.2021.106552 article EN Information and Software Technology 2021-02-19

Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

Algorithm classification is to automatically identify the classes of a program based on algorithm(s) and/or data structure(s) implemented in program. It can be useful for various tasks, such as code reuse, theft detection, and malware detection. Code similarity metrics, basis features extracted from syntax semantics, have been used classify programs. Such features, however, often need manual selection effort are specific individual programming languages, limiting classifiers programs same...

10.1109/saner.2019.8667995 article EN 2019-02-01

TreeCaps: Tree-Based Capsule Networks for Source Code Processing

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., abstract syntax trees) and/or semantic information dependency graphs). While graphs may be better than trees at capturing semantics, constructing the from inputs through analysis of multiple viewpoints can lead inaccurate noises for a specific software engineering task. Compared graphs, are more precisely defined grammar and easier parse; unfortunately, previous tree-based...

10.1609/aaai.v35i1.16074 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

AutoFocus: Interpreting Attention-Based Neural Networks by Code Perturbation

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

Despite being adopted in software engineering tasks, deep neural networks are treated mostly as a black box due to the difficulty interpreting how infer outputs from inputs. To address this problem, we propose AutoFocus, an automated approach for rating and visualizing importance of input elements based on their effects networks. The is built our hypotheses that (1) attention mechanisms incorporated into can generate discriminative scores various (2) reflect This paper verifies by applying...

10.1109/ase.2019.00014 article EN 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2019-11-01

SAR: learning cross-language API mappings with little knowledge

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

To save effort, developers often translate programs from one programming language to another, instead of implementing it scratch. Translating application program interfaces (APIs) used in functionally equivalent ones available another is an important aspect translation. Existing approaches facilitate the translation by automatically identifying API mappings across languages. However, these still require large amount parallel corpora, ranging pairs APIs or code fragments that are equivalent,...

10.1145/3338906.3338924 article EN 2019-08-09

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

OPENALEX - Publications

Nghi D. Q. Bui Yijun Yu Lingxiao Jiang

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data retrieval and summarization tasks. The pre-trained model can be used in two ways: (1) it produce vector representation which applied tasks that do not have data; (2) fine-tuning process might still require label such as summarization. key innovation we train by asking recognize similar dissimilar snippets through objective. To so, use set...

10.1145/3404835.3462840 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

OPENALEX - Publications

Yue Wang Hung Lê Akhilesh Gotmare Nghi D. Q. Bui Junnan Li and 1 more

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in intelligence. However, existing LLMs two main limitations terms of architecture and pretraining tasks. First, they often adopt a specific (encoder-only or decoder-only) rely unified encoder-decoder network for different downstream The former paradigm is limited by inflexibility applications while the latter, model treated as single system all tasks, leading to suboptimal performance subset...

10.48550/arxiv.2305.07922 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks

OPENALEX - Publications

Nghi D. Q. Bui Lingxiao Jiang Yijun Yu

Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top two (TBCNNs), each which recognizes written in individual language. The combination layer similarities and differences among different languages. BiTBCNNs are trained source languages but known to implement same algorithms...

10.48550/arxiv.1710.06159 preprint EN other-oa arXiv (Cornell University) 2017-01-01

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

OPENALEX - Publications

Nghi D. Q. Bui H Le Yue Wang Junnan Li Akhilesh Gotmare and 1 more

Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential tackling these tasks by leveraging massive open-source code data and programming features. However, the development deployment of such often require expertise both machine learning engineering, creating barrier for model adoption. In this paper, we present CodeTF, an library...

10.48550/arxiv.2306.00029 preprint EN other-oa arXiv (Cornell University) 2023-01-01

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

OPENALEX - Publications

Dung Nguyen Manh Nam Le Hai Anh T. V. Dau Anh Minh Nguyen Khanh Nghiem and 2 more

Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 2023.

10.18653/v1/2023.nlposs-1.25 article EN cc-by 2023-01-01

RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion

OPENALEX - Publications

Huy N. Phan Hoang Ngo Phan Tien N. Nguyen Nghi D. Q. Bui

Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context a project repository, such as intricacies relevant files and class hierarchies, which can result less precise completions. To overcome these limitations, we present RepoHyper, multifaceted framework designed to address complex challenges associated with repository-level completion. Central RepoHyper is Repo-level...

10.48550/arxiv.2403.06095 preprint EN arXiv (Cornell University) 2024-03-10

Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

OPENALEX - Publications

Khanh Nghiem Anh Minh Nguyen Nghi D. Q. Bui

As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE coding assistants. assistants should set clear expectations usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, collect app data responsibly downstream analyses. We propose open questions challenges that academia industry address to realize the vision of next-generation

10.48550/arxiv.2403.14592 preprint EN arXiv (Cornell University) 2024-03-21

TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing

OPENALEX - Publications

Vinoj Jayasundara Nghi D. Q. Bui Lingxiao Jiang David Lo

Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand large amount of existing code before they can develop new features or fix bugs programs. Being able process programming language automatically provide summaries functionality accurately significantly help reduce time spent navigation understanding, thus increase productivity. Different from natural articles, source languages follows rigid syntactical...

10.48550/arxiv.1910.12306 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Bootstrapping Code-Text Pretrained Language Model to Detect Inconsistency Between Code and Comment

OPENALEX - Publications

Anh T. V. Dau Nghi D. Q. Bui Jin Guo

Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between comments becomes increasingly challenging. Recognizing growing interest in automated solutions detecting correcting differences accompanying comments, current methods rely primarily on heuristic rules. In contrast, this paper presents DocChecker, a tool powered by deep learning. DocChecker is adept at...

10.48550/arxiv.2306.06347 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Hierarchical learning of cross-language mappings through distributed vector representations for code

OPENALEX - Publications

Nghi D. Q. Bui Lingxiao Jiang

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations different languages. Although past studies have considered this problem, they may either specific the grammars, or certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes new approach automatically learn cross-language representations various structural used translation. Our key idea is two folded: First, we...

10.1145/3183399.3183427 preprint EN 2018-05-27

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

OPENALEX - Publications

Minh Huynh Nguyen T. P. Chau Phong X. Nguyen Nghi D. Q. Bui

Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into framework. This assigns specific AM roles such Product Manager, Developer, and Tester to different agents, who then collaboratively develop based on user inputs. AgileCoder enhances efficiency organizing work...

10.48550/arxiv.2406.11912 preprint EN arXiv (Cornell University) 2024-06-16

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

OPENALEX - Publications

Nam Le Hai Dung Nguyen Nghi D. Q. Bui

The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating generation repository-level scale, emphasizing executability correctness. \methodnamews provides an automated system that verifies requirements incorporates mechanism dynamically generating high-coverage test cases assess functionality generated code. Our work explores controlled scenario...

10.48550/arxiv.2406.11927 preprint EN arXiv (Cornell University) 2024-06-17

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

OPENALEX - Publications

Nam Le Hai Nghi D. Q. Bui

10.1145/3643787.3648044 article EN 2024-04-20

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

OPENALEX - Publications

Nam Le Hai Nghi D. Q. Bui

Code comments provide important information for understanding the source code. They can help developers understand overall purpose of a function or class, as well identify bugs and technical debt. However, an overabundance is meaningless counterproductive. As result, it critical to automatically filter out these specific purposes. In this paper, we present Dopamin, Transformer-based tool dealing with issue. Our model excels not only in presenting knowledge sharing common categories across...

10.48550/arxiv.2408.04663 preprint EN arXiv (Cornell University) 2024-08-06

XMainframe: A Large Language Model for Mainframe Modernization

OPENALEX - Publications

Anh T. V. Dau Hieu Trung Dao Anh Tuan Nguyen Hieu Tran Phong X. Nguyen and 1 more

Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance modernization. Addressing this challenge necessitates innovative tools that can understand interact with legacy codebases. To end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed knowledge of mainframe COBOL Our solution involves...

10.48550/arxiv.2408.04660 preprint EN arXiv (Cornell University) 2024-08-05

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

OPENALEX - Publications

Huy N. Phan Phong X. Nguyen Nghi D. Q. Bui

Large Language Models (LLMs) have revolutionized software engineering (SE), demonstrating remarkable capabilities in various coding tasks. While recent efforts produced autonomous agents based on LLMs for end-to-end development tasks, these systems are typically designed specific SE We introduce HyperAgent, a novel generalist multi-agent system to address wide spectrum of tasks across different programming languages by mimicking human developers' workflows. Comprising four specialized -...

10.48550/arxiv.2409.16299 preprint EN arXiv (Cornell University) 2024-09-09

Coming Soon ...