skmultiflow.transform.MissingValuesCleaner¶

class skmultiflow.transform.MissingValuesCleaner(missing_value=nan, strategy='zero', window_size=200, new_value=1)[source]¶

Fill missing values with some defined value.

Provides a simple way to replace missing values in data samples with some value. The imputation value can be set via a set of imputation strategies.

Parameters

missing_value: int, float or list (Default: numpy.nan): Missing value to replace
strategy: string (Default: ‘zero’): The strategy adopted to find the missing value replacement. It can be one of the following: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.
window_size: int (Default: 200): Defines the window size for the ‘mean’, ‘median’ and ‘mode’ strategies.
new_value: int (Default: 1): This is the replacement value in case the chosen strategy is ‘custom’.

Notes

A missing value in a sample can be coded in many different ways, but the most common one is to use numpy’s NaN, that’s why that is the default missing value parameter.

The user should choose the correct substitution strategy for his use case, as each strategy has its pros and cons. The strategy can be chosen from a set of predefined strategies, which are: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.

Notice that MissingValuesCleaner can actually be used to replace arbitrary values.

Examples

>>> # Imports
>>> import numpy as np
>>> from skmultiflow.data.file_stream import FileStream
>>> from skmultiflow.transform.missing_values_cleaner import MissingValuesCleaner
>>> # Setting up a stream
>>> stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/"
...                     "streaming-datasets/master/covtype.csv")
>>> # Setting up the filter to substitute values -47 by the median of the
>>> # last 10 samples
>>> cleaner = MissingValuesCleaner(-47, 'median', 10)
>>> X, y = stream.next_sample(10)
>>> X[9, 0] = -47
>>> # We will use this list to keep track of values
>>> data = []
>>> # Iterate over the first 9 samples, to build a sample window
>>> for i in range(9):
>>>     X_transf = cleaner.partial_fit_transform([X[i].tolist()])
>>>     data.append(X_transf[0][0])
>>>
>>> # Transform last sample. The first feature should be replaced by the list's 
>>> # median value
>>> X_transf = cleaner.partial_fit_transform([X[9].tolist()])
>>> np.median(data)

Methods

`get_info`(self)	Collects and returns the information about the configuration of the estimator
`get_params`(self[, deep])	Get parameters for this estimator.
`partial_fit`(self, X[, y])	Partial fits the model.
`partial_fit_transform`(self, X[, y])	Partially fits the model and then apply the transform to the data.
`reset`(self)	Resets the estimator to its initial state.
`set_params`(self, **params)	Set the parameters of this estimator.
`transform`(self, X)	Does the transformation process in the samples in X.

get_info(self)[source]¶

Collects and returns the information about the configuration of the estimator

Returns

string: Configuration of the estimator.

get_params(self, deep=True)[source]¶

Get parameters for this estimator.

Parameters

deepboolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

partial_fit(self, X, y=None)[source]¶

Partial fits the model.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): The sample or set of samples that should be transformed.
y: Array-like: The true labels.

Returns

MissingValuesCleaner: self

partial_fit_transform(self, X, y=None)[source]¶

Partially fits the model and then apply the transform to the data.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): The sample or set of samples that should be transformed.
y: Array-like: The true labels.

Returns

numpy.ndarray of shape (n_samples, n_features): The transformed data.

reset(self)[source]¶

Resets the estimator to its initial state.

Returns

self

set_params(self, **params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self

transform(self, X)[source]¶

Does the transformation process in the samples in X.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): The sample or set of samples that should be transformed.