INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM

Organization Name

Inventor(s)

Babatunde Micheal Okutubo of Bellevue WA (US)

Maninderjit Singh Parmar of Redmond WA (US)

INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM - A simplified explanation of the abstract

This abstract first appeared for US patent application 18463941 titled 'INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM

Simplified Explanation

Methods and systems are provided for improved access to rows of data in a distributed data system.

The data rows are associated with partitions and distributed in one or more files.
An impure file includes data rows associated with multiple partitions.
A clustering set is generated by selecting a candidate impure file based on file access activity metrics and neighbor impure files.
Data rows of the impure files in the clustering set are sorted according to their associated partitions.
A set of disjoint partition range files is generated based on the sorted data rows.
Each file in the set is transferred to a respective target partition.

Potential Applications

This technology can be applied in various industries and scenarios where there is a need for improved access to rows of data in a distributed data system. Some potential applications include:

Big data analytics platforms
Distributed databases
Cloud computing environments
Data-intensive applications

Problems Solved

The technology addresses the following problems:

Inefficient access to rows of data in a distributed data system
Difficulty in managing and organizing data rows associated with multiple partitions
Performance bottlenecks caused by impure files with data rows from multiple partitions

Benefits

The use of this technology offers several benefits:

Improved efficiency and speed in accessing data rows
Enhanced organization and management of data rows associated with partitions
Reduction of performance bottlenecks and improved overall system performance

Original Abstract Submitted

Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metrics and one or more neighbor impure files. Data rows of the impure files included in the clustering set are sorted according to their respective associated partitions. A set of disjoint partition range files are generated based on the sorted data rows of the impure files included in the clustering set. Each file of the set of disjoint partition range files is transferred to a respective target partition.

18463941. INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM simplified abstract (Microsoft Technology Licensing, LLC)

Contents

INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM

Organization Name

Inventor(s)

INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM - A simplified explanation of the abstract

Simplified Explanation

Potential Applications

Problems Solved

Benefits

Original Abstract Submitted

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools