17836977. Techniques for Pretraining Document Language Models for Example-Based Document Classification simplified abstract (Microsoft Technology Licensing, LLC)

From WikiPatents
Jump to navigation Jump to search

Techniques for Pretraining Document Language Models for Example-Based Document Classification

Organization Name

Microsoft Technology Licensing, LLC

Inventor(s)

Guoxin Wang of Bellevue WA (US)

Dinei Afonso Ferreira Florencio of Redmond WA (US)

Wenfeng Cheng of Bellevue WA (US)

Techniques for Pretraining Document Language Models for Example-Based Document Classification - A simplified explanation of the abstract

This abstract first appeared for US patent application 17836977 titled 'Techniques for Pretraining Document Language Models for Example-Based Document Classification

Simplified Explanation

The patent application describes a data processing system that trains machine learning models to analyze unlabeled documents. The system receives a set of unlabeled documents associated with specific categories and fine-tunes two machine learning models.

  • The system receives unlabeled documents and categories to train machine learning models.
  • Two machine learning models are fine-tuned based on the unlabeled documents.
  • The first model determines a semantic representation of the categories.
  • The second model classifies the semantic representations according to the categories.
  • The models are trained using unlabeled training data from different categories.

Potential Applications

This technology has potential applications in various fields, including:

  • Document classification: The system can be used to automatically categorize and organize large volumes of documents based on their content.
  • Information retrieval: The trained models can help improve search engines by accurately understanding and categorizing documents for better search results.
  • Content recommendation: By analyzing the semantic representations of documents, the system can provide personalized content recommendations to users based on their interests.

Problems Solved

The technology addresses several problems in machine learning and document analysis:

  • Unlabeled document analysis: The system enables the training of machine learning models using unlabeled documents, allowing for more efficient and scalable analysis.
  • Semantic representation: The first model learns to determine a semantic representation of document categories, which enhances the understanding and classification of documents.
  • Multi-category training: By training the models with unlabeled data from different categories, the system can handle a wide range of document types and improve classification accuracy.

Benefits

The technology offers several benefits:

  • Improved document analysis: The fine-tuned machine learning models can accurately analyze and classify unlabeled documents, leading to more efficient and effective document processing.
  • Scalability: By training the models with unlabeled data, the system can handle large volumes of documents without the need for manual labeling.
  • Flexibility: The system can be adapted to different domains and document types by training the models with relevant unlabeled data.


Original Abstract Submitted

A data processing system implements a method for training machine learning modes, including receiving a set of one or more unlabeled documents associated one or more first categories of documents to be used to train machine learning models to analyze the one or more unlabeled documents, and fine-tuning a first machine learning model and a second machine learning model based on the one or more unlabeled document to enable the first machine learning model to determine a semantic representation of the one or more first categories of document, and to enable the second machine learning model to classify the semantic representations according to the one or more first categories of documents, the first machine learning model and the second machine learning model having been trained using first unlabeled training data including a second plurality of categories of documents that do not include the one or more first categories of documents.