skmultiflow.meta.AccuracyWeightedEnsembleClassifier¶

class skmultiflow.meta.AccuracyWeightedEnsembleClassifier(n_estimators=10, n_kept_estimators=30, base_estimator=NaiveBayes(nominal_attributes=None), window_size=200, n_splits=5)[source]¶

Accuracy Weighted Ensemble classifier

Parameters

n_estimators: int (default=10): Maximum number of estimators to be kept in the ensemble
base_estimator: skmultiflow.core.BaseSKMObject or sklearn.BaseEstimator (default=NaiveBayes): Each member of the ensemble is an instance of the base estimator.
window_size: int (default=200): The size of one chunk to be processed (warning: the chunk size is not always the same as the batch size)
n_splits: int (default=5): Number of folds to run cross-validation for computing the weight of a classifier in the ensemble

Notes

An Accuracy Weighted Ensemble (AWE) [1] is an ensemble of classification models in which each model is judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. The ensemble guarantees to be efficient and robust against concept-drifting streams.

References

1: Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘03). ACM, New York, NY, USA, 226-235.

Examples

>>> # Imports
>>> from skmultiflow.data import SEAGenerator
>>> from skmultiflow.meta import AccuracyWeightedEnsembleClassifier
>>>
>>> # Setting up a data stream
>>> stream = SEAGenerator(random_state=1)
>>>
>>> # Setup Accuracy Weighted Ensemble Classifier
>>> awe = AccuracyWeightedEnsembleClassifier()
>>>
>>> # Setup variables to control loop and track performance
>>> n_samples = 0
>>> correct_cnt = 0
>>> max_samples = 200
>>>
>>> # Train the classifier with the samples provided by the data stream
>>> while n_samples < max_samples and stream.has_more_samples():
>>>     X, y = stream.next_sample()
>>>     y_pred = awe.predict(X)
>>>     if y[0] == y_pred[0]:
>>>         correct_cnt += 1
>>>     awe.partial_fit(X, y)
>>>     n_samples += 1
>>>
>>> # Display results
>>> print('{} samples analyzed.'.format(n_samples))
>>> print('Accuracy Weighted Ensemble accuracy: {}'.format(correct_cnt / n_samples))

Methods

`compute_baseline`(y)	This method computes the score produced by a random classifier, served as a baseline.
`compute_score`(model, X, y)	Computes the mean square error of a classifier, via the predicted probabilities.
`compute_score_crossvalidation`(self, model, …)	Computes the score of interests, using cross-validation or not.
`compute_weight`(self, model, baseline_score)	Computes the weight of a classifier given the baseline score calculated on a random learner.
`do_instance_pruning`(self)
`fit`(self, X, y[, classes, sample_weight])	Fit the model.
`get_info`(self)	Collects and returns the information about the configuration of the estimator
`get_params`(self[, deep])	Get parameters for this estimator.
`partial_fit`(self, X[, y, classes, sample_weight])	Partially (incrementally) fit the model.
`predict`(self, X)	Predicts the labels of X in a general classification setting.
`predict_proba`(self, X)	Estimates the probability of each sample in X belonging to each of the class-labels.
`reset`(self)	Resets all parameters to its default value
`score`(self, X, y[, sample_weight])	Returns the mean accuracy on the given test data and labels.
`set_params`(self, **params)	Set the parameters of this estimator.
`train_model`(model, X, y[, classes, …])	Trains a model, taking care of the fact that either fit or partial_fit is implemented

class WeightedClassifier(estimator, weight, seen_labels)[source]¶

A wrapper that includes a base estimator and its associated weight (and additional information)

Parameters

estimator: StreamModel or sklearn.BaseEstimator: The base estimator to be wrapped up with additional information. This estimator must already been trained on a data chunk.
weight: float: The weight associated to this estimator
seen_labels: array: The array containing the unique class labels of the data chunk this estimator is trained on.

static compute_baseline(y)[source]¶

This method computes the score produced by a random classifier, served as a baseline. The baseline score is MSE_r in case of a normal classifier, b_r in case of a cost-sensitive classifier.

Parameters

y: numpy.array: The labels of the chunk

Returns

float: The baseline score of a random learner

static compute_score(model, X, y)[source]¶

Computes the mean square error of a classifier, via the predicted probabilities.

This code needs to take into account the fact that a classifier C trained on a previous data chunk may not have seen all the labels that appear in a new chunk (e.g. C is trained with only labels [1, 2] but the new chunk contains labels [1, 2, 3, 4, 5]

Parameters

model: StreamModel or sklearn.BaseEstimator: The estimator in the ensemble to compute the score on
X: numpy.ndarray of shape (window_size, n_features): The data from the new chunk
y: numpy.array: The labels from the new chunk

Returns

float: The mean square error of the model (MSE_i)

compute_score_crossvalidation(self, model, n_splits)[source]¶

Computes the score of interests, using cross-validation or not.

Parameters

model: StreamModel or sklearn.BaseEstimator: The estimator in the ensemble to compute the score on
n_splits: int: The number of CV folds. If None, the score is computed directly on the entire data chunk. Else, we proceed as in traditional cross-validation setting.

Returns

float: The score of an estimator computed via CV

compute_weight(self, model, baseline_score, n_splits=None)[source]¶

Computes the weight of a classifier given the baseline score calculated on a random learner. The weight relies on either (1) MSE if it is a normal classifier, or (2) benefit if it is a cost-sensitive classifier.

Parameters

model: StreamModel or sklearn.BaseEstimator: The learner to compute the weight on
baseline_score: float: The baseline score calculated on a random learner
n_splits: int (default=None): The number of CV folds. If not None (and is a number), we compute the weight using CV

Returns

float: The weight computed from the MSE score of the classifier

fit(self, X, y, classes=None, sample_weight=None)[source]¶

Fit the model.

Parameters

Xnumpy.ndarray of shape (n_samples, n_features): The features to train the model.
y: numpy.ndarray of shape (n_samples, n_targets): An array-like with the class labels of all samples in X.
classes: numpy.ndarray, optional (default=None): Contains all possible/known class labels. Usage varies depending on the learning method.
sample_weight: numpy.ndarray, optional (default=None): Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.

Returns

self

get_info(self)[source]¶

Collects and returns the information about the configuration of the estimator

Returns

string: Configuration of the estimator.

get_params(self, deep=True)[source]¶

Get parameters for this estimator.

Parameters

deepboolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

partial_fit(self, X, y=None, classes=None, sample_weight=None)[source]¶

Partially (incrementally) fit the model.

Updates the ensemble when a new data chunk arrives (Algorithm 1 in the paper). The update is only launched when the chunk is filled up.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): The features to train the model.
y: numpy.ndarray of shape (n_samples): An array-like with the class labels of all samples in X.
classes: numpy.ndarray, optional (default=None): Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
sample_weight: float or array-like: Samples weight. If not provided, uniform weights are assumed.

predict(self, X)[source]¶

Predicts the labels of X in a general classification setting.

The prediction is done via normalized weighted voting (choosing the maximum).

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Samples for which we want to predict the labels.

Returns

numpy.array: Predicted labels for all instances in X.

predict_proba(self, X)[source]¶

Estimates the probability of each sample in X belonging to each of the class-labels.

Parameters

Xnumpy.ndarray of shape (n_samples, n_features): The matrix of samples one wants to predict the class probabilities for.

Returns

A numpy.ndarray of shape (n_samples, n_labels), in which each outer entry is associated
with the X entry of the same index. And where the list in index [i] contains
len(self.target_values) elements, each of which represents the probability that
the i-th sample of X belongs to a certain class-label.

reset(self)[source]¶: Resets all parameters to its default value

score(self, X, y, sample_weight=None)[source]¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like, shape = (n_samples, n_features): Test samples.
yarray-like, shape = (n_samples) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like, shape = [n_samples], optional: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(self, **params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self

static train_model(model, X, y, classes=None, sample_weight=None)[source]¶

Trains a model, taking care of the fact that either fit or partial_fit is implemented

Parameters

model: StreamModel or sklearn.BaseEstimator: The model to train
X: numpy.ndarray of shape (n_samples, n_features): The data chunk
y: numpy.array of shape (n_samples): The labels in the chunk
classes: list or numpy.array: The unique classes in the data chunk
sample_weight: float or array-like: Instance weight. If not provided, uniform weights are assumed.

Returns

StreamModel or sklearn.BaseEstimator: The trained model