18463941. INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM simplified abstract (Microsoft Technology Licensing, LLC)
INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM
Organization Name
Microsoft Technology Licensing, LLC
Inventor(s)
Babatunde Micheal Okutubo of Bellevue WA (US)
Maninderjit Singh Parmar of Redmond WA (US)
Edgars Sedols of Bellevue WA (US)
INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM - A simplified explanation of the abstract
This abstract first appeared for US patent application 18463941 titled 'INCREMENTALLY IMPROVING CLUSTERING OF CROSS PARTITION DATA IN A DISTRIBUTED DATA SYSTEM
Simplified Explanation
Methods and systems are provided for improved access to rows of data in a distributed data system.
- The data rows are associated with partitions and distributed in one or more files.
- An impure file includes data rows associated with multiple partitions.
- A clustering set is generated by selecting a candidate impure file based on file access activity metrics and neighbor impure files.
- Data rows of the impure files in the clustering set are sorted according to their associated partitions.
- A set of disjoint partition range files is generated based on the sorted data rows.
- Each file in the set is transferred to a respective target partition.
Potential Applications
This technology can be applied in various industries and scenarios where there is a need for improved access to rows of data in a distributed data system. Some potential applications include:
- Big data analytics platforms
- Distributed databases
- Cloud computing environments
- Data-intensive applications
Problems Solved
The technology addresses the following problems:
- Inefficient access to rows of data in a distributed data system
- Difficulty in managing and organizing data rows associated with multiple partitions
- Performance bottlenecks caused by impure files with data rows from multiple partitions
Benefits
The use of this technology offers several benefits:
- Improved efficiency and speed in accessing data rows
- Enhanced organization and management of data rows associated with partitions
- Reduction of performance bottlenecks and improved overall system performance
Original Abstract Submitted
Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metrics and one or more neighbor impure files. Data rows of the impure files included in the clustering set are sorted according to their respective associated partitions. A set of disjoint partition range files are generated based on the sorted data rows of the impure files included in the clustering set. Each file of the set of disjoint partition range files is transferred to a respective target partition.