AUTOMATED CATEGORIZATION AND PROCESSING OF DOCUMENT IMAGES OF VARYING DEGREES OF QUALITY

Organization Name

Inventor(s)

AUTOMATED CATEGORIZATION AND PROCESSING OF DOCUMENT IMAGES OF VARYING DEGREES OF QUALITY - A simplified explanation of the abstract

This abstract first appeared for US patent application 18514772 titled 'AUTOMATED CATEGORIZATION AND PROCESSING OF DOCUMENT IMAGES OF VARYING DEGREES OF QUALITY

Simplified Explanation

The patent application describes an apparatus that uses a machine learning algorithm to classify text extracted from images of pages. The processor identifies tokens within the text, removes invalid tokens using a dictionary, calculates a score based on the ratio of valid tokens, and classifies the text into a category if the score is above a threshold.

Memory and processor in apparatus
Dictionary and machine learning algorithm stored in memory
Image of a page converted into text by processor
Identification of tokens within the text
Removal of invalid tokens using dictionary
Calculation of score based on ratio of valid tokens
Classification of text into a category if score is above threshold
Storage of image and/or text in a database according to category

Potential Applications

This technology can be applied in document processing, automated data entry, and content categorization tasks.

Problems Solved

This technology solves the problem of efficiently classifying text extracted from images and storing it in a database based on its content.

Benefits

The benefits of this technology include improved accuracy in text classification, automated data processing, and enhanced organization of information.

Potential Commercial Applications

The technology can be used in document management systems, data entry software, and content management platforms to streamline processes and improve efficiency.

Possible Prior Art

One possible prior art for this technology could be optical character recognition (OCR) software that converts images of text into editable text files.

Unanswered Questions

How does the machine learning algorithm improve the accuracy of text classification?

The machine learning algorithm is trained to classify text based on patterns and features extracted from the data. By learning from a large dataset, the algorithm can identify subtle differences between categories and make more accurate classifications.

What types of categories can the text be classified into?

The text can be classified into various categories based on the training data provided to the machine learning algorithm. These categories can range from specific topics or themes to broader classifications such as sentiment analysis or language detection.

Original Abstract Submitted

An apparatus includes a memory and a processor. The memory stores a dictionary and a machine learning algorithm trained to classify text. The processor receives an image of a page, converts the image into a set of text, and identifies a plurality of tokens within the text. Each token includes one or more contiguous characters that are both preceded and followed by whitespace within the text. The processor identifies invalid tokens by removing tokens of the plurality of tokens that correspond to words of the dictionary. The processor calculates, based on a ratio of a total number of valid tokens to a total number of tokens, a score. In response to determining that the score is greater than a threshold, the processor applies the machine learning algorithm to classify the text into a category and stores the image and/or text in a database according to the category.

18514772. AUTOMATED CATEGORIZATION AND PROCESSING OF DOCUMENT IMAGES OF VARYING DEGREES OF QUALITY simplified abstract (Bank of America Corporation)

Contents

AUTOMATED CATEGORIZATION AND PROCESSING OF DOCUMENT IMAGES OF VARYING DEGREES OF QUALITY

Organization Name

Inventor(s)