TEXT STRING COMPARISON FOR DUPLICATE OR NEAR-DUPLICATE TEXT DOCUMENTS IDENTIFIED USING AUTOMATED NEAR-DUPLICATE DETECTION FOR TEXT DOCUMENTS

Abstract: techniques described herein provide for text string comparison for documents identified using automated near-duplicate detection. in one example, a system can receive a pair of documents. the system can extract text strings from the documents. the system can normalize the extracted text strings using a predefined normalization scheme. the system can identify boilerplate text segments in the normalized text strings. the system can remove the boilerplate text segments from the normalized text strings to generate filtered text strings. the system can divide the filtered text strings by identifying section indicators. the system can, for each section, generate groupings of text strings and determine a similarity score between each pair of corresponding groupings to identify matching groupings of text strings. the system can generate an output for display showing the visual indications of the matched groupings of text strings.

Inventor(s): Fan WANG, Teresa S. JADE, Xu YANG

CPC Classification: G06F16/906 (Details of database functions independent of the retrieved data types)

Search for rejections for patent application number 20250231993