Jump to content

18464422. Augmenting Tokenizer Training Data (PayPal, Inc.)

From WikiPatents

Augmenting Tokenizer Training Data

Organization Name

PayPal, Inc.

Inventor(s)

Ofek Levy of Tel Aviv (IL)

Augmenting Tokenizer Training Data

This abstract first appeared for US patent application 18464422 titled 'Augmenting Tokenizer Training Data

Original Abstract Submitted

Techniques are disclosed for altering tokenizer training data causing a tokenizer to generate an improved library of tokens from the altered training data for tokenizing new source data during training of a machine learning model e.g., for natural language processing. A system retrieves, from a source database, a set of original text data that includes characters. The server system identifies a plurality of strings included in the set of original text data. The system alters strings included in the original text data to generate altered text data by selecting current characters included in the strings to be altered. The system trains, using the altered text data, a tokenizer by assigning a threshold to the tokenizer and inputting the altered text data into the tokenizer, where the tokenizer generates tokens for strings in the altered text data based on the threshold. The system stores the trained tokenizer.

Cookies help us deliver our services. By using our services, you agree to our use of cookies.