- Software Engineering Research
- Software Testing and Debugging Techniques
- Topic Modeling
- Advanced Malware Detection Techniques
- Natural Language Processing Techniques
- Software System Performance and Reliability
- Software Reliability and Analysis Research
- Web Data Mining and Analysis
- Adversarial Robustness in Machine Learning
- Computability, Logic, AI Algorithms
- Model-Driven Software Engineering Techniques
- Business Process Modeling and Analysis
- Sentiment Analysis and Opinion Mining
- Multi-Agent Systems and Negotiation
- Software Engineering Techniques and Practices
- Web Application Security Vulnerabilities
- Machine Learning and Data Classification
- Semantic Web and Ontologies
- Explainable Artificial Intelligence (XAI)
- Data Mining Algorithms and Applications
- Anomaly Detection Techniques and Applications
- Collaboration in agile enterprises
- Text Readability and Simplification
- Network Security and Intrusion Detection
- Advanced Text Analysis Techniques
Fulbright University Vietnam
2023-2024
Singapore Management University
2017-2022
Learning code representations has found many uses in software engineering, such as classification, search, comment generation, and bug prediction, etc. Although of tokens, syntax trees, dependency graphs, paths or the combinations their variants have been proposed, existing learning techniques a major limitation that these models are often trained on datasets labeled for specific downstream tasks, may not be suitable other tasks. Even though some generate from unlabeled code, they far being...
Algorithm classification is to automatically identify the classes of a program based on algorithm(s) and/or data structure(s) implemented in program. It can be useful for various tasks, such as code reuse, theft detection, and malware detection. Code similarity metrics, basis features extracted from syntax semantics, have been used classify programs. Such features, however, often need manual selection effort are specific individual programming languages, limiting classifiers programs same...
Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., abstract syntax trees) and/or semantic information dependency graphs). While graphs may be better than trees at capturing semantics, constructing the from inputs through analysis of multiple viewpoints can lead inaccurate noises for a specific software engineering task. Compared graphs, are more precisely defined grammar and easier parse; unfortunately, previous tree-based...
Despite being adopted in software engineering tasks, deep neural networks are treated mostly as a black box due to the difficulty interpreting how infer outputs from inputs. To address this problem, we propose AutoFocus, an automated approach for rating and visualizing importance of input elements based on their effects networks. The is built our hypotheses that (1) attention mechanisms incorporated into can generate discriminative scores various (2) reflect This paper verifies by applying...
To save effort, developers often translate programs from one programming language to another, instead of implementing it scratch. Translating application program interfaces (APIs) used in functionally equivalent ones available another is an important aspect translation. Existing approaches facilitate the translation by automatically identifying API mappings across languages. However, these still require large amount parallel corpora, ranging pairs APIs or code fragments that are equivalent,...
We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data retrieval and summarization tasks. The pre-trained model can be used in two ways: (1) it produce vector representation which applied tasks that do not have data; (2) fine-tuning process might still require label such as summarization. key innovation we train by asking recognize similar dissimilar snippets through objective. To so, use set...
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in intelligence. However, existing LLMs two main limitations terms of architecture and pretraining tasks. First, they often adopt a specific (encoder-only or decoder-only) rely unified encoder-decoder network for different downstream The former paradigm is limited by inflexibility applications while the latter, model treated as single system all tasks, leading to suboptimal performance subset...
Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top two (TBCNNs), each which recognizes written in individual language. The combination layer similarities and differences among different languages. BiTBCNNs are trained source languages but known to implement same algorithms...
Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential tackling these tasks by leveraging massive open-source code data and programming features. However, the development deployment of such often require expertise both machine learning engineering, creating barrier for model adoption. In this paper, we present CodeTF, an library...
Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 2023.
Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context a project repository, such as intricacies relevant files and class hierarchies, which can result less precise completions. To overcome these limitations, we present RepoHyper, multifaceted framework designed to address complex challenges associated with repository-level completion. Central RepoHyper is Repo-level...
As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE coding assistants. assistants should set clear expectations usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, collect app data responsibly downstream analyses. We propose open questions challenges that academia industry address to realize the vision of next-generation
Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand large amount of existing code before they can develop new features or fix bugs programs. Being able process programming language automatically provide summaries functionality accurately significantly help reduce time spent navigation understanding, thus increase productivity. Different from natural articles, source languages follows rigid syntactical...
Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between comments becomes increasingly challenging. Recognizing growing interest in automated solutions detecting correcting differences accompanying comments, current methods rely primarily on heuristic rules. In contrast, this paper presents DocChecker, a tool powered by deep learning. DocChecker is adept at...
Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations different languages. Although past studies have considered this problem, they may either specific the grammars, or certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes new approach automatically learn cross-language representations various structural used translation. Our key idea is two folded: First, we...
Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into framework. This assigns specific AM roles such Product Manager, Developer, and Tester to different agents, who then collaboratively develop based on user inputs. AgileCoder enhances efficiency organizing work...
The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating generation repository-level scale, emphasizing executability correctness. \methodnamews provides an automated system that verifies requirements incorporates mechanism dynamically generating high-coverage test cases assess functionality generated code. Our work explores controlled scenario...
Code comments provide important information for understanding the source code. They can help developers understand overall purpose of a function or class, as well identify bugs and technical debt. However, an overabundance is meaningless counterproductive. As result, it critical to automatically filter out these specific purposes. In this paper, we present Dopamin, Transformer-based tool dealing with issue. Our model excels not only in presenting knowledge sharing common categories across...
Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance modernization. Addressing this challenge necessitates innovative tools that can understand interact with legacy codebases. To end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed knowledge of mainframe COBOL Our solution involves...
Large Language Models (LLMs) have revolutionized software engineering (SE), demonstrating remarkable capabilities in various coding tasks. While recent efforts produced autonomous agents based on LLMs for end-to-end development tasks, these systems are typically designed specific SE We introduce HyperAgent, a novel generalist multi-agent system to address wide spectrum of tasks across different programming languages by mimicking human developers' workflows. Comprising four specialized -...