COMBINING DATA SELECTION AND REWARD FUNCTIONS FOR TUNING LARGE LANGUAGE MODELS USING REINFORCEMENT LEARNING

Abstract: a computer-implemented method, a computer program product, and a computer system for tuning large language models. a computer receives pairs of textual prompts and ground truth labels. a computer creates a data selection scoring function, by repurposing one or more reward functions to compute similarity between the textual prompts and the ground truth labels, where the one or more reward functions measure similarity between textual outputs produced by a large language model and the ground truth labels. a computer selects a training dataset from the pairs of the textual prompts and the ground truth labels, by using the data selection scoring function. a computer tunes the large language model using the training dataset and reinforcement learning with the one or more reward functions.

Inventor(s): Long VU, Nhan Huu Pham, Dharmashankar Subramanian, Todd William Mummert

CPC Classification: G06F40/40 (ELECTRIC DIGITAL DATA PROCESSING (computer systems based on specific computational models ))

Search for rejections for patent application number 20250232129