20240054288. Inference Methods For Word Or Wordpiece Tokenization simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Inference Methods For Word Or Wordpiece Tokenization

Organization Name

GOOGLE LLC

Inventor(s)

Xinying Song of Bellevue WA (US)

Yang Song of Bellevue WA (US)

Inference Methods For Word Or Wordpiece Tokenization - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240054288 titled 'Inference Methods For Word Or Wordpiece Tokenization

Simplified Explanation

- Systems and methods for performing inference for word or wordpiece tokenization using a left-to-right longest-match-first greedy process are disclosed. - The vocabulary may be organized into a trie structure where each node includes a precomputed token or token_id and a fail link for efficient parsing. - The tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_ids that correspond to the longest matching vocabulary entries in the sample string, without backtracking. - Nodes in the trie may have a fail link and a prev_match link to optimize tokenization and avoid redundant processing.

Potential Applications

This technology can be applied in natural language processing, machine translation, text analysis, and information retrieval systems.

Problems Solved

- Efficient word or wordpiece tokenization without the need for backtracking. - Optimized parsing of vocabulary for faster processing.

Benefits

- Improved performance in tokenization tasks. - Reduced computational resources required for processing text data. - Enhanced accuracy in identifying and extracting tokens from input text.


Original Abstract Submitted

systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. in some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_id and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_ids that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. in some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_id(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_id(s).