LARGE LANGUAGE MODEL DRIVEN DATA AUGMENTATION FOR PROTEIN MACHINE LEARNING

Abstract: a method for training a machine learning model (mlm) to predict the activity of a protein is described herein. in an example, a method involves accessing a set of training data comprising labeled examples with known activity levels. a large language model is used to generate synthetic examples of each labeled example by incorporating each possible amino acid (aa) mutation at each aa position in the labeled example and predicting the probability each aa mutation has of replacing the original aa. based on a predetermined cutoff, a subset of negative synthetic examples that comprises at least one aa mutation with the lowest probability of being incorporated are selected. an augmented training dataset is generated and a mlm is trained, using the training data and the augmented training data set, by performing iterative operations to find a set of parameters that jointly minimize the sum of at least two loss functions.

Inventor(s): Federico Vaggi

CPC Classification: G16B40/00 (ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding)

Search for rejections for patent application number 20250218545