Optimal Subword Tokenization and Vocabulary Creation

Abstract: subword tokenization is provided. the method comprises receiving a text document comprising n bytes and specifying a maximum token width of l bytes. an initial vocabulary of tokens is defined, wherein the tokens comprise a number of different n-grams of l or less bytes. the document is tokenized with the fewest number of tokens from the vocabulary according to a minimum total weight through a directed acyclic graph comprising nodes that represent intervals between the bytes in the document and edges that represent potential tokens from the vocabulary appearing in text of the document. natural language processing is then performed on the text document according to the tokenization.

Inventor(s): Craig William Schmidt

CPC Classification: G06F40/40 (ELECTRIC DIGITAL DATA PROCESSING (computer systems based on specific computational models ))

Search for rejections for patent application number 20250232132