skmultiflow.meta.AccuracyWeightedEnsembleClassifier

class skmultiflow.meta.AccuracyWeightedEnsembleClassifier(n_estimators=10, n_kept_estimators=30, base_estimator=NaiveBayes(nominal_attributes=None), window_size=200, n_splits=5)[source]

Accuracy Weighted Ensemble classifier

Parameters
n_estimators: int (default=10)

Maximum number of estimators to be kept in the ensemble

base_estimator: skmultiflow.core.BaseSKMObject or sklearn.BaseEstimator (default=NaiveBayes)

Each member of the ensemble is an instance of the base estimator.

window_size: int (default=200)

The size of one chunk to be processed (warning: the chunk size is not always the same as the batch size)

n_splits: int (default=5)

Number of folds to run cross-validation for computing the weight of a classifier in the ensemble

Notes

An Accuracy Weighted Ensemble (AWE) [1] is an ensemble of classification models in which each model is judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. The ensemble guarantees to be efficient and robust against concept-drifting streams.

References

1

Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘03). ACM, New York, NY, USA, 226-235.

Examples

>>> # Imports
>>> from skmultiflow.data import SEAGenerator
>>> from skmultiflow.meta import AccuracyWeightedEnsembleClassifier
>>>
>>> # Setting up a data stream
>>> stream = SEAGenerator(random_state=1)
>>>
>>> # Setup Accuracy Weighted Ensemble Classifier
>>> awe = AccuracyWeightedEnsembleClassifier()
>>>
>>> # Setup variables to control loop and track performance
>>> n_samples = 0
>>> correct_cnt = 0
>>> max_samples = 200
>>>
>>> # Train the classifier with the samples provided by the data stream
>>> while n_samples < max_samples and stream.has_more_samples():
>>>     X, y = stream.next_sample()
>>>     y_pred = awe.predict(X)
>>>     if y[0] == y_pred[0]:
>>>         correct_cnt += 1
>>>     awe.partial_fit(X, y)
>>>     n_samples += 1
>>>
>>> # Display results
>>> print('{} samples analyzed.'.format(n_samples))
>>> print('Accuracy Weighted Ensemble accuracy: {}'.format(correct_cnt / n_samples))

Methods

compute_baseline(y)

This method computes the score produced by a random classifier, served as a baseline.

compute_score(model, X, y)

Computes the mean square error of a classifier, via the predicted probabilities.

compute_score_crossvalidation(self, model, …)

Computes the score of interests, using cross-validation or not.

compute_weight(self, model, baseline_score)

Computes the weight of a classifier given the baseline score calculated on a random learner.

do_instance_pruning(self)

fit(self, X, y[, classes, sample_weight])

Fit the model.

get_info(self)

Collects and returns the information about the configuration of the estimator

get_params(self[, deep])

Get parameters for this estimator.

partial_fit(self, X[, y, classes, sample_weight])

Partially (incrementally) fit the model.

predict(self, X)

Predicts the labels of X in a general classification setting.

predict_proba(self, X)

Estimates the probability of each sample in X belonging to each of the class-labels.

reset(self)

Resets all parameters to its default value

score(self, X, y[, sample_weight])

Returns the mean accuracy on the given test data and labels.

set_params(self, **params)

Set the parameters of this estimator.

train_model(model, X, y[, classes, …])

Trains a model, taking care of the fact that either fit or partial_fit is implemented

class WeightedClassifier(estimator, weight, seen_labels)[source]

A wrapper that includes a base estimator and its associated weight (and additional information)

Parameters
estimator: StreamModel or sklearn.BaseEstimator

The base estimator to be wrapped up with additional information. This estimator must already been trained on a data chunk.

weight: float

The weight associated to this estimator

seen_labels: array

The array containing the unique class labels of the data chunk this estimator is trained on.

static compute_baseline(y)[source]

This method computes the score produced by a random classifier, served as a baseline. The baseline score is MSEr in case of a normal classifier, br in case of a cost-sensitive classifier.

Parameters
y: numpy.array

The labels of the chunk

Returns
float

The baseline score of a random learner

static compute_score(model, X, y)[source]

Computes the mean square error of a classifier, via the predicted probabilities.

This code needs to take into account the fact that a classifier C trained on a previous data chunk may not have seen all the labels that appear in a new chunk (e.g. C is trained with only labels [1, 2] but the new chunk contains labels [1, 2, 3, 4, 5]

Parameters
model: StreamModel or sklearn.BaseEstimator

The estimator in the ensemble to compute the score on

X: numpy.ndarray of shape (window_size, n_features)

The data from the new chunk

y: numpy.array

The labels from the new chunk

Returns
float

The mean square error of the model (MSE_i)

compute_score_crossvalidation(self, model, n_splits)[source]

Computes the score of interests, using cross-validation or not.

Parameters
model: StreamModel or sklearn.BaseEstimator

The estimator in the ensemble to compute the score on

n_splits: int

The number of CV folds. If None, the score is computed directly on the entire data chunk. Else, we proceed as in traditional cross-validation setting.

Returns
float

The score of an estimator computed via CV

compute_weight(self, model, baseline_score, n_splits=None)[source]

Computes the weight of a classifier given the baseline score calculated on a random learner. The weight relies on either (1) MSE if it is a normal classifier, or (2) benefit if it is a cost-sensitive classifier.

Parameters
model: StreamModel or sklearn.BaseEstimator

The learner to compute the weight on

baseline_score: float

The baseline score calculated on a random learner

n_splits: int (default=None)

The number of CV folds. If not None (and is a number), we compute the weight using CV

Returns
float

The weight computed from the MSE score of the classifier

fit(self, X, y, classes=None, sample_weight=None)[source]

Fit the model.

Parameters
Xnumpy.ndarray of shape (n_samples, n_features)

The features to train the model.

y: numpy.ndarray of shape (n_samples, n_targets)

An array-like with the class labels of all samples in X.

classes: numpy.ndarray, optional (default=None)

Contains all possible/known class labels. Usage varies depending on the learning method.

sample_weight: numpy.ndarray, optional (default=None)

Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.

Returns
self
get_info(self)[source]

Collects and returns the information about the configuration of the estimator

Returns
string

Configuration of the estimator.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

partial_fit(self, X, y=None, classes=None, sample_weight=None)[source]

Partially (incrementally) fit the model.

Updates the ensemble when a new data chunk arrives (Algorithm 1 in the paper). The update is only launched when the chunk is filled up.

Parameters
X: numpy.ndarray of shape (n_samples, n_features)

The features to train the model.

y: numpy.ndarray of shape (n_samples)

An array-like with the class labels of all samples in X.

classes: numpy.ndarray, optional (default=None)

Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba

sample_weight: float or array-like

Samples weight. If not provided, uniform weights are assumed.

predict(self, X)[source]

Predicts the labels of X in a general classification setting.

The prediction is done via normalized weighted voting (choosing the maximum).

Parameters
X: numpy.ndarray of shape (n_samples, n_features)

Samples for which we want to predict the labels.

Returns
numpy.array

Predicted labels for all instances in X.

predict_proba(self, X)[source]

Estimates the probability of each sample in X belonging to each of the class-labels.

Parameters
Xnumpy.ndarray of shape (n_samples, n_features)

The matrix of samples one wants to predict the class probabilities for.

Returns
A numpy.ndarray of shape (n_samples, n_labels), in which each outer entry is associated
with the X entry of the same index. And where the list in index [i] contains
len(self.target_values) elements, each of which represents the probability that
the i-th sample of X belongs to a certain class-label.
reset(self)[source]

Resets all parameters to its default value

score(self, X, y, sample_weight=None)[source]

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like, shape = (n_samples, n_features)

Test samples.

yarray-like, shape = (n_samples) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like, shape = [n_samples], optional

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self
static train_model(model, X, y, classes=None, sample_weight=None)[source]

Trains a model, taking care of the fact that either fit or partial_fit is implemented

Parameters
model: StreamModel or sklearn.BaseEstimator

The model to train

X: numpy.ndarray of shape (n_samples, n_features)

The data chunk

y: numpy.array of shape (n_samples)

The labels in the chunk

classes: list or numpy.array

The unique classes in the data chunk

sample_weight: float or array-like

Instance weight. If not provided, uniform weights are assumed.

Returns
StreamModel or sklearn.BaseEstimator

The trained model