AUTOMATED NEAR-DUPLICATE DETECTION FOR TEXT DOCUMENTS

Abstract: techniques described herein provide for automated detection of near-duplicate documents. in one example, a system can cluster documents into a set of clusters based on character frequencies associated with the documents. for a given cluster, the system can generate first similarity scores associated with every pair of documents in the cluster. the system can then select a filtered group of documents associated with first similarity scores that meet or exceed a first predefined similarity threshold. next, the system can convert the filtered group of documents into matrix representations. the system can generate second similarity scores for every pair of matrix representations. the system can then identify documents, from among the filtered group of documents, associated with second similarity scores that meet or exceed a second predefined similarity threshold. the identified documents can be duplicate or near-duplicate text documents.

Inventor(s): Fan Wang, Teresa S. Jade, Xu Yang

CPC Classification: G06F16/906 (Details of database functions independent of the retrieved data types)

Search for rejections for patent application number 20250165536