Salesforce, inc. (20250060944). AUTOMATED DATA EXTRACTION PIPELINE FOR LARGE LANGUAGE MODEL TRAINING
AUTOMATED DATA EXTRACTION PIPELINE FOR LARGE LANGUAGE MODEL TRAINING
Organization Name
Inventor(s)
Shruthan Radhakrishna of San Francisco CA (US)
Hadi Minooei of San Francisco CA (US)
Yazdan Jamshidi of San Francisco CA (US)
AUTOMATED DATA EXTRACTION PIPELINE FOR LARGE LANGUAGE MODEL TRAINING
This abstract first appeared for US patent application 20250060944 titled 'AUTOMATED DATA EXTRACTION PIPELINE FOR LARGE LANGUAGE MODEL TRAINING
Original Abstract Submitted
an automated data extraction pipeline for large language model (llm) training may include extracting a set of code segments from a set of natural language question-answer (q&a) combinations that each include a provided input, a provided output, and a provided code segment formatted to transform the provided input into the provided output. the data extraction pipeline may then generate a predicted output from a question portion of a first natural language q&a combination using a first llm. a first extracted code segment from the extracted set of code segments may then be executed to generate a first actual output of the first extracted code segment. one or more data samples may then be generated for training a second llm based on a comparison of the first actual output to the predicted output. the second llm may then be trained using the one or more data samples.