US Patent Application 17804218. DATA SELECTION FOR MACHINE LEARNING MODELS BASED ON DATA PROFILING simplified abstract

From WikiPatents
Jump to navigation Jump to search

DATA SELECTION FOR MACHINE LEARNING MODELS BASED ON DATA PROFILING

Organization Name

International Business Machines Corporation

Inventor(s)

Paulina Toro Isaza of White Plains NY (US)

Yu Deng of Yorktown Heights NY (US)

Michael Elton Nidd of Zurich (CH)

Harshit Kumar of Delhi (IN)

Larisa Shwartz of Greenwich CT (US)

DATA SELECTION FOR MACHINE LEARNING MODELS BASED ON DATA PROFILING - A simplified explanation of the abstract

This abstract first appeared for US patent application 17804218 titled 'DATA SELECTION FOR MACHINE LEARNING MODELS BASED ON DATA PROFILING

Simplified Explanation

The patent application describes a method, computer system, and computer program for data selection.

  • The invention involves generating a first model for a dataset and determining its performance level based on various dataset metric values.
  • If the first model's performance level fails to exceed a performance threshold, the invention creates multiple data subsets from the dataset.
  • The invention calculates subset metric values for each data subset and generates a second model based on these values.
  • The second model's performance level is then determined, and if it exceeds the performance threshold, an optimization associated with the first model is determined.


Original Abstract Submitted

A method, computer system, and a computer program for data selection is provided. The present invention may include generating a first model associated with a dataset. The present invention may further include determining a first model performance level associated with the first model based on a plurality of dataset metric values of the dataset. The present invention may further include a plurality of data subsets of a dataset based on the first model performance level failing to exceed a performance threshold and calculating a plurality of subset metric values associated with the plurality of data subsets. The present invention may further include generating a second model associated with at least one data subset based on the plurality of subset metric values and determining an optimization associated with the first model based on a second model performance level associated with the second model exceeding the performance threshold.