Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++

Fortran Code (set theory) Similarity (geometry) Python
DOI: 10.1109/hpec58863.2023.10363534 Publication Date: 2023-12-25T19:39:57Z
ABSTRACT
In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability applicability, the is created from range of representative open-source benchmarks. It also refined using meticulous code similarity test. The effectiveness our assessed both quantitative (CodeBLEU) qualitative (human evaluation) methods. We showcase how significantly elevates translation competencies large language (LLMs). Specifically, without prior coding knowledge experienced boost x 5.1 in their CodeBLEU scores, while with some familiarity saw an impressive 9.9-fold increase. best fine-tuned model outperforms GPT-4. reaching human-level accuracy. This work underscores immense potential propelling advancements domain high-performance computing. accessible at https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (20)
CITATIONS (4)