skmultiflow.data.
SEAGenerator
SEA stream generator.
This generator is an implementation of the data stream with abrupt concept drift, first described in Street and Kim’s ‘A streaming ensemble algorithm (SEA) for large-scale classification’ [1].
It generates 3 numerical attributes, that vary from 0 to 10, where only 2 of them are relevant to the classification task. A classification function is chosen, among four possible ones. These functions compare the sum of the two relevant attributes with a threshold value, unique for each of the classification functions. Depending on the comparison the generator will classify an instance as one of the two possible labels.
Function 0: if \((att1 + att2 \leq 8)\) else 1
Function 1: if \((att1 + att2 \leq 9)\) else 1
Function 2: if \((att1 + att2 \leq 7)\) else 1
Function 3: if \((att1 + att2 \leq 9.5)\) else 1
Concept drift can be introduced by changing the classification function. This can be done manually or using ConceptDriftStream.
ConceptDriftStream
This data stream has two additional parameters, the first is to balance classes, which means the class distribution will tend to a uniform one, and the possibility to add noise, which will, according to some probability, change the chosen label for an instance.
Which of the four classification functions to use for the generation. This value can vary from 0 to 3, and the thresholds are, 8, 9, 7 and 9.5.
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Whether to balance classes or not. If balanced, the class distribution will converge to a uniform distribution.
The probability that noise will happen in the generation. At each new sample generated, a random probability is generated, and if that probability is higher than the noise_percentage, the chosen label will be switched. From 0.0 to 1.0.
References
W. Nick Street and YongSeog Kim. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 377-382. DOI=http://dx.doi.org/10.1145/502512.502568
Examples
>>> # Imports >>> from skmultiflow.data.sea_generator import SEAGenerator >>> # Setting up the stream >>> stream = SEAGenerator(classification_function = 2, random_state = 112, ... balance_classes = False, noise_percentage = 0.28) >>> # Retrieving one sample >>> stream.next_sample() (array([[ 3.75057129, 6.4030462 , 9.50016579]]), array([ 0.])) >>> # Retrieving 10 samples >>> stream.next_sample(10) (array([[ 7.76929659, 8.32745763, 0.5480574 ], [ 8.85351458, 7.22346511, 0.02556032], [ 3.43419851, 0.94759888, 3.94642589], [ 7.3670683 , 9.55806869, 8.20609371], [ 3.78544458, 7.84763615, 0.86231513], [ 1.6222602 , 2.90069726, 0.45008172], [ 7.36533216, 8.39211485, 7.09361615], [ 9.8566856 , 3.88003308, 5.03154482], [ 6.8373245 , 7.21957381, 2.14152091], [ 0.75216155, 6.10890702, 4.25630425]]), array([ 1., 1., 1., 1., 1., 0., 0., 1., 1., 1.])) >>> # Generators will have infinite remaining instances, so it returns -1 >>> stream.n_remaining_samples() -1 >>> stream.has_more_samples() True
Methods
generate_drift(self)
generate_drift
Generate drift by switching the classification function randomly.
get_data_info(self)
get_data_info
Retrieves minimum information from the stream
get_info(self)
get_info
Collects and returns the information about the configuration of the estimator
get_params(self[, deep])
get_params
Get parameters for this estimator.
has_more_samples(self)
has_more_samples
Checks if stream has more samples.
is_restartable(self)
is_restartable
Determine if the stream is restartable.
last_sample(self)
last_sample
Retrieves last batch_size samples in the stream.
n_remaining_samples(self)
n_remaining_samples
Returns the estimated number of remaining samples.
next_sample(self[, batch_size])
next_sample
Returns next sample from the stream.
prepare_for_use()
prepare_for_use
Prepare the stream for use.
reset(self)
reset
Resets the estimator to its initial state.
restart(self)
restart
Restart the stream.
set_params(self, **params)
set_params
Set the parameters of this estimator.
Attributes
balance_classes
Retrieve the value of the option: Balance classes.
classification_function
Retrieve the index of the current classification function.
feature_names
Retrieve the names of the features.
n_cat_features
Retrieve the number of integer features.
n_features
Retrieve the number of features.
n_num_features
Retrieve the number of numerical features.
n_targets
Retrieve the number of targets
noise_percentage
Retrieve the value of the value of Noise percentage
target_names
Retrieve the names of the targets
target_values
Retrieve all target_values in the stream for each target.
True is the classes are balanced
index of the classification function [0,1,2,3]
names of the features
Used by evaluator methods to id the stream.
The default format is: ‘Stream name - n_targets, n_classes, n_features’.
Stream data information
Configuration of the estimator.
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Parameter names mapped to their values.
True if stream has more samples.
True if stream is restartable.
A numpy.ndarray of shape (batch_size, n_features) and an array-like of shape (batch_size, n_targets), representing the next batch_size samples.
The number of integer features in the stream.
The total number of features.
The number of numerical features in the stream.
Remaining number of samples. -1 if infinite (e.g. generator)
the number of targets in the stream.
The sample generation works as follows: The three attributes are generated with the random generator, initialized with the seed passed by the user. Then, the classification function decides, as a function of the two relevant attributes, whether to classify the instance as class 0 or class 1. The next step is to verify if the classes should be balanced, and if so, balance the classes. The last step is to add noise, if the noise percentage is higher than 0.0.
The generated sample will have 3 features, where only the two first are relevant, and 1 label (it has one classification task).
The number of samples to return.
Return a tuple with the features matrix and the labels matrix for the batch_size samples that were requested.
percentage of the noise
Deprecated in v0.5.0 and will be removed in v0.7.0
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
<component>__<parameter>
the names of the targets in the stream.
list of lists of all target_values for each target