18463941. INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM simplified abstract (Microsoft Technology Licensing, LLC)

From WikiPatents
Jump to navigation Jump to search

INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM

Organization Name

Microsoft Technology Licensing, LLC

Inventor(s)

Babatunde Micheal Okutubo of Bellevue WA (US)

Maninderjit Singh Parmar of Redmond WA (US)

Edgars Sedols of Bellevue WA (US)

INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM - A simplified explanation of the abstract

This abstract first appeared for US patent application 18463941 titled 'INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM

Simplified Explanation

Methods and systems are provided for improved access to rows of data in a distributed data system.

  • The data rows are associated with partitions and distributed in one or more files.
  • An impure file includes data rows associated with multiple partitions.
  • A clustering set is generated by selecting a candidate impure file based on file access activity metrics and neighbor impure files.
  • Data rows of the impure files in the clustering set are sorted according to their associated partitions.
  • A set of disjoint partition range files is generated based on the sorted data rows.
  • Each file in the set is transferred to a respective target partition.

Potential Applications

This technology can be applied in various industries and scenarios where there is a need for improved access to rows of data in a distributed data system. Some potential applications include:

  • Big data analytics platforms
  • Distributed databases
  • Cloud computing environments
  • Data-intensive applications

Problems Solved

The technology addresses the following problems:

  • Inefficient access to rows of data in a distributed data system
  • Difficulty in managing and organizing data rows associated with multiple partitions
  • Performance bottlenecks caused by impure files with data rows from multiple partitions

Benefits

The use of this technology offers several benefits:

  • Improved efficiency and speed in accessing data rows
  • Enhanced organization and management of data rows associated with partitions
  • Reduction of performance bottlenecks and improved overall system performance


Original Abstract Submitted

Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metrics and one or more neighbor impure files. Data rows of the impure files included in the clustering set are sorted according to their respective associated partitions. A set of disjoint partition range files are generated based on the sorted data rows of the impure files included in the clustering set. Each file of the set of disjoint partition range files is transferred to a respective target partition.