Inference Methods For Word Or Wordpiece Tokenization

Organization Name

Inventor(s)

Inference Methods For Word Or Wordpiece Tokenization - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240054288 titled 'Inference Methods For Word Or Wordpiece Tokenization

Simplified Explanation

- Systems and methods for performing inference for word or wordpiece tokenization using a left-to-right longest-match-first greedy process are disclosed. - The vocabulary may be organized into a trie structure where each node includes a precomputed token or token_id and a fail link for efficient parsing. - The tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_ids that correspond to the longest matching vocabulary entries in the sample string, without backtracking. - Nodes in the trie may have a fail link and a prev_match link to optimize tokenization and avoid redundant processing.

Potential Applications

This technology can be applied in natural language processing, machine translation, text analysis, and information retrieval systems.

Problems Solved

- Efficient word or wordpiece tokenization without the need for backtracking. - Optimized parsing of vocabulary for faster processing.

Benefits

- Improved performance in tokenization tasks. - Reduced computational resources required for processing text data. - Enhanced accuracy in identifying and extracting tokens from input text.

Original Abstract Submitted

systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. in some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_id and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_ids that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. in some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_id(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_id(s).

20240054288. Inference Methods For Word Or Wordpiece Tokenization simplified abstract (GOOGLE LLC)

Contents

Inference Methods For Word Or Wordpiece Tokenization

Organization Name

Inventor(s)