18205609. Inference Methods For Word Or Wordpiece Tokenization simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Inference Methods For Word Or Wordpiece Tokenization

Organization Name

GOOGLE LLC

Inventor(s)

Xinying Song of Bellevue WA (US)

Yang Song of Bellevue WA (US)

Inference Methods For Word Or Wordpiece Tokenization - A simplified explanation of the abstract

This abstract first appeared for US patent application 18205609 titled 'Inference Methods For Word Or Wordpiece Tokenization

Simplified Explanation

    • Explanation:**

The patent application describes systems and methods for word or wordpiece tokenization using a left-to-right longest-match-first greedy process. The vocabulary is organized into a trie structure with nodes containing precomputed tokens or token IDs and fail links for efficient parsing.

  • The vocabulary is organized into a trie structure.
  • Each node in the trie contains a precomputed token or token ID and a fail link.
  • The tokenizer parses the trie in a single pass to generate a list of tokens or token IDs corresponding to the longest matching entries in the sample string.
  • Nodes with shared tokens or token IDs have a prev_match link pointing back to a chain of nodes with those tokens or token IDs.
    • Potential Applications:**
  • Natural language processing
  • Machine translation
  • Text analysis and classification
    • Problems Solved:**
  • Efficient word or wordpiece tokenization
  • Avoiding the need for backtracking during tokenization
  • Improving the speed and accuracy of text processing tasks
    • Benefits:**
  • Faster tokenization process
  • Reduced computational resources required
  • Improved accuracy in identifying tokens or token IDs in text data


Original Abstract Submitted

Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).