IMPUTING MACHINE LEARNING TRAINING DATA

Organization Name

INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventor(s)

A PENG Zhang of Xian (CN)

Xiao Ming Ma of Xi'an (CN)

Lei Gao of Xian (CN)

Jin Wang of Xi'an (US)

Kai Li of Xian (CN)

IMPUTING MACHINE LEARNING TRAINING DATA - A simplified explanation of the abstract

This abstract first appeared for US patent application 17457665 titled 'IMPUTING MACHINE LEARNING TRAINING DATA

Simplified Explanation

The patent application describes a method for imputing missing values in a dataset using a cluster model and linear regression.

The method involves creating a correlation list of predictors with missing values.
A cluster model is then generated based on a target value and predictor values.
The method determines an imputed value for a missing value in a row of training data by using a linear regression model.
The linear regression model uses multiple non-missing value predictor values for the clusters.

Potential applications of this technology:

Data analysis and prediction models that rely on complete datasets can benefit from this method.
It can be used in various fields such as finance, healthcare, and marketing where missing data is common.

Problems solved by this technology:

Missing data can often lead to biased or inaccurate results in data analysis and prediction models.
This method helps to address the issue of missing values by providing imputed values based on the available data.

Benefits of this technology:

The method allows for more accurate and reliable analysis and prediction models by imputing missing values.
It reduces the need for manual data imputation, saving time and effort.
The cluster model and linear regression approach provide a systematic and efficient way to handle missing values in datasets.

Original Abstract Submitted

Embodiments are disclosed for a method. The method includes determining a correlation list of missing value predictors. The method also includes generating a cluster model having multiple clusters. The cluster model is based on a target value and predictor values. The method further includes determining an imputed value for a missing value of a row of original training data based on a linear regression model for multiple non-missing value predictor values for the clusters.