GENERATING SYNTHETIC TRAINING DATA FOR PROGRAMMING LANGUAGE TRANSLATION

Organization Name

Inventor(s)

GENERATING SYNTHETIC TRAINING DATA FOR PROGRAMMING LANGUAGE TRANSLATION - A simplified explanation of the abstract

This abstract first appeared for US patent application 17940618 titled 'GENERATING SYNTHETIC TRAINING DATA FOR PROGRAMMING LANGUAGE TRANSLATION

Simplified Explanation

The patent application describes techniques for generating synthetic paired source code snippets that are semantically equivalent but syntactically distinct.

Few shot learning may be used to prompt a large language model based on demonstration source code snippets in syntactically constrained pseudocode.
The large language model generates additional source code snippets in the syntactically constrained pseudocode and in additional programming languages.
Training source code snippets in the syntactically constrained pseudocode are programmatically translated to create synthetic training pairs of semantically equivalent source code snippets.
Each synthetic training pair includes snippets in first and second programming languages, which can be used to train a machine learning translation model to translate between the languages.

Potential Applications

This technology could be applied in software development tools to automatically generate code snippets in multiple programming languages, aiding developers in writing cross-language compatible code.

Problems Solved

This technology helps address the challenge of translating code between different programming languages accurately and efficiently, saving time and effort for developers working on multi-language projects.

Benefits

The benefits of this technology include improved productivity for developers, increased code compatibility across languages, and enhanced accuracy in code translation tasks.

Potential Commercial Applications

One potential commercial application of this technology could be in the development of integrated development environments (IDEs) that offer automatic code translation features for multi-language projects.

Possible Prior Art

One possible prior art in this field is the use of machine learning models for code translation tasks, but the specific approach of generating synthetic paired source code snippets for training translation models may be novel.

Unanswered Questions

How does this technology handle complex code structures and logic during translation between programming languages?

The article does not delve into the specifics of how the translation model deals with intricate code structures and logic when converting between programming languages.

What are the potential limitations or challenges faced by this technology in real-world applications?

The article does not address any potential drawbacks or obstacles that may arise when implementing this technology in practical software development environments.

Original Abstract Submitted

Techniques are described herein for generating synthetic paired source code snippets that are semantically equivalent but syntactically distinct. In various implementations, few shot learning may be performed to prompt a large language model, based on demonstration source code snippet(s) in syntactically constrained pseudocode, to generate additional source code snippets in the syntactically constrained pseudocode. Based on additional source code snippets in additional programming language(s), the large language model may be used to generate more training source code snippets in the syntactically constrained pseudocode. The training source code snippets in the syntactically constrained pseudocode may be programmatically translated to generate synthetic training pairs of semantically equivalent source code snippets. Each synthetic training pair of the plurality of synthetic training pairs may include training snippets in first and second programming languages, and may be usable to train a machine learning translation model to translate between the first and second programming languages.