skmultiflow.meta.
AccuracyWeightedEnsembleClassifier
Accuracy Weighted Ensemble classifier
Maximum number of estimators to be kept in the ensemble
Each member of the ensemble is an instance of the base estimator.
The size of one chunk to be processed (warning: the chunk size is not always the same as the batch size)
Number of folds to run cross-validation for computing the weight of a classifier in the ensemble
Notes
An Accuracy Weighted Ensemble (AWE) [1] is an ensemble of classification models in which each model is judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. The ensemble guarantees to be efficient and robust against concept-drifting streams.
References
Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘03). ACM, New York, NY, USA, 226-235.
Examples
>>> # Imports >>> from skmultiflow.data import SEAGenerator >>> from skmultiflow.meta import AccuracyWeightedEnsembleClassifier >>> >>> # Setting up a data stream >>> stream = SEAGenerator(random_state=1) >>> >>> # Setup Accuracy Weighted Ensemble Classifier >>> awe = AccuracyWeightedEnsembleClassifier() >>> >>> # Setup variables to control loop and track performance >>> n_samples = 0 >>> correct_cnt = 0 >>> max_samples = 200 >>> >>> # Train the classifier with the samples provided by the data stream >>> while n_samples < max_samples and stream.has_more_samples(): >>> X, y = stream.next_sample() >>> y_pred = awe.predict(X) >>> if y[0] == y_pred[0]: >>> correct_cnt += 1 >>> awe.partial_fit(X, y) >>> n_samples += 1 >>> >>> # Display results >>> print('{} samples analyzed.'.format(n_samples)) >>> print('Accuracy Weighted Ensemble accuracy: {}'.format(correct_cnt / n_samples))
Methods
compute_baseline(y)
compute_baseline
This method computes the score produced by a random classifier, served as a baseline.
compute_score(model, X, y)
compute_score
Computes the mean square error of a classifier, via the predicted probabilities.
compute_score_crossvalidation(self, model, …)
compute_score_crossvalidation
Computes the score of interests, using cross-validation or not.
compute_weight(self, model, baseline_score)
compute_weight
Computes the weight of a classifier given the baseline score calculated on a random learner.
do_instance_pruning(self)
do_instance_pruning
fit(self, X, y[, classes, sample_weight])
fit
Fit the model.
get_info(self)
get_info
Collects and returns the information about the configuration of the estimator
get_params(self[, deep])
get_params
Get parameters for this estimator.
partial_fit(self, X[, y, classes, sample_weight])
partial_fit
Partially (incrementally) fit the model.
predict(self, X)
predict
Predicts the labels of X in a general classification setting.
predict_proba(self, X)
predict_proba
Estimates the probability of each sample in X belonging to each of the class-labels.
reset(self)
reset
Resets all parameters to its default value
score(self, X, y[, sample_weight])
score
Returns the mean accuracy on the given test data and labels.
set_params(self, **params)
set_params
Set the parameters of this estimator.
train_model(model, X, y[, classes, …])
train_model
Trains a model, taking care of the fact that either fit or partial_fit is implemented
WeightedClassifier
A wrapper that includes a base estimator and its associated weight (and additional information)
The base estimator to be wrapped up with additional information. This estimator must already been trained on a data chunk.
The weight associated to this estimator
The array containing the unique class labels of the data chunk this estimator is trained on.
This method computes the score produced by a random classifier, served as a baseline. The baseline score is MSEr in case of a normal classifier, br in case of a cost-sensitive classifier.
The labels of the chunk
The baseline score of a random learner
This code needs to take into account the fact that a classifier C trained on a previous data chunk may not have seen all the labels that appear in a new chunk (e.g. C is trained with only labels [1, 2] but the new chunk contains labels [1, 2, 3, 4, 5]
The estimator in the ensemble to compute the score on
The data from the new chunk
The labels from the new chunk
The mean square error of the model (MSE_i)
The number of CV folds. If None, the score is computed directly on the entire data chunk. Else, we proceed as in traditional cross-validation setting.
The score of an estimator computed via CV
Computes the weight of a classifier given the baseline score calculated on a random learner. The weight relies on either (1) MSE if it is a normal classifier, or (2) benefit if it is a cost-sensitive classifier.
The learner to compute the weight on
The baseline score calculated on a random learner
The number of CV folds. If not None (and is a number), we compute the weight using CV
The weight computed from the MSE score of the classifier
The features to train the model.
An array-like with the class labels of all samples in X.
Contains all possible/known class labels. Usage varies depending on the learning method.
Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.
Configuration of the estimator.
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Parameter names mapped to their values.
Updates the ensemble when a new data chunk arrives (Algorithm 1 in the paper). The update is only launched when the chunk is filled up.
Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
Samples weight. If not provided, uniform weights are assumed.
The prediction is done via normalized weighted voting (choosing the maximum).
Samples for which we want to predict the labels.
Predicted labels for all instances in X.
The matrix of samples one wants to predict the class probabilities for.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Test samples.
True labels for X.
Sample weights.
Mean accuracy of self.predict(X) wrt. y.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
<component>__<parameter>
The model to train
The data chunk
The labels in the chunk
The unique classes in the data chunk
Instance weight. If not provided, uniform weights are assumed.
The trained model