EXPEDITING AUTOMATED NEAR-DUPLICATE DETECTION FOR NEW TEXT DOCUMENTS

Abstract: techniques described herein provide for automated near-duplicate detection for new text documents given text documents that were previously processed using automated near-duplicate detection for text documents. in one example, a system can receive new documents and documents that were previously processed using a predefined processing technique for automated near-duplicate detection. the system can process the new documents and cluster the new documents into multiple predefined clusters previously identified using the predefined processing technique. for each predefined cluster including at least one new document, the system can generate document groups by determining similarity scores using the predefined processing technique as applied to the documents in the predefined clusters. the system can identify a representative document for each document group and generate an output data structure including the document groups and the representative document for each group.

Inventor(s): Fan WANG, Teresa S. JADE, Xu YANG

CPC Classification: G06F16/906 (Details of database functions independent of the retrieved data types)

Search for rejections for patent application number 20250231992