skmultiflow.transform.MissingValuesCleaner

class skmultiflow.transform.MissingValuesCleaner(missing_value=nan, strategy='zero', window_size=200, new_value=1)[source]

Fill missing values with some defined value.

Provides a simple way to replace missing values in data samples with some value. The imputation value can be set via a set of imputation strategies.

Parameters
missing_value: int, float or list (Default: numpy.nan)

Missing value to replace

strategy: string (Default: ‘zero’)

The strategy adopted to find the missing value replacement. It can be one of the following: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.

window_size: int (Default: 200)

Defines the window size for the ‘mean’, ‘median’ and ‘mode’ strategies.

new_value: int (Default: 1)

This is the replacement value in case the chosen strategy is ‘custom’.

Notes

A missing value in a sample can be coded in many different ways, but the most common one is to use numpy’s NaN, that’s why that is the default missing value parameter.

The user should choose the correct substitution strategy for his use case, as each strategy has its pros and cons. The strategy can be chosen from a set of predefined strategies, which are: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.

Notice that MissingValuesCleaner can actually be used to replace arbitrary values.

Examples

>>> # Imports
>>> import numpy as np
>>> from skmultiflow.data.file_stream import FileStream
>>> from skmultiflow.transform.missing_values_cleaner import MissingValuesCleaner
>>> # Setting up a stream
>>> stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/"
...                     "streaming-datasets/master/covtype.csv")
>>> # Setting up the filter to substitute values -47 by the median of the
>>> # last 10 samples
>>> cleaner = MissingValuesCleaner(-47, 'median', 10)
>>> X, y = stream.next_sample(10)
>>> X[9, 0] = -47
>>> # We will use this list to keep track of values
>>> data = []
>>> # Iterate over the first 9 samples, to build a sample window
>>> for i in range(9):
>>>     X_transf = cleaner.partial_fit_transform([X[i].tolist()])
>>>     data.append(X_transf[0][0])
>>>
>>> # Transform last sample. The first feature should be replaced by the list's 
>>> # median value
>>> X_transf = cleaner.partial_fit_transform([X[9].tolist()])
>>> np.median(data)

Methods

get_info(self)

Collects and returns the information about the configuration of the estimator

get_params(self[, deep])

Get parameters for this estimator.

partial_fit(self, X[, y])

Partial fits the model.

partial_fit_transform(self, X[, y])

Partially fits the model and then apply the transform to the data.

reset(self)

Resets the estimator to its initial state.

set_params(self, **params)

Set the parameters of this estimator.

transform(self, X)

Does the transformation process in the samples in X.

get_info(self)[source]

Collects and returns the information about the configuration of the estimator

Returns
string

Configuration of the estimator.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

partial_fit(self, X, y=None)[source]

Partial fits the model.

Parameters
X: numpy.ndarray of shape (n_samples, n_features)

The sample or set of samples that should be transformed.

y: Array-like

The true labels.

Returns
MissingValuesCleaner

self

partial_fit_transform(self, X, y=None)[source]

Partially fits the model and then apply the transform to the data.

Parameters
X: numpy.ndarray of shape (n_samples, n_features)

The sample or set of samples that should be transformed.

y: Array-like

The true labels.

Returns
numpy.ndarray of shape (n_samples, n_features)

The transformed data.

reset(self)[source]

Resets the estimator to its initial state.

Returns
self
set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self
transform(self, X)[source]

Does the transformation process in the samples in X.

Parameters
X: numpy.ndarray of shape (n_samples, n_features)

The sample or set of samples that should be transformed.