skmultiflow.data.AGRAWALGenerator

class skmultiflow.data.AGRAWALGenerator(classification_function=0, random_state=None, balance_classes=False, perturbation=0.0)[source]

Agrawal stream generator.

The generator was introduced by Agrawal et al. in [1], and was common source of data for early work on scaling up decision tree learners. The generator produces a stream containing nine features, six numeric and three categorical. There are ten functions defined for generating binary class labels from the features. Presumably these determine whether the loan should be approved. The features and functions are listed in the original paper [1].

feature name

feature description

values

salary

the salary

uniformly distributed from 20k to 150k

commission

the commission

if (salary < 75k) then 0 else uniformly distributed from 10k to 75k

age

the age

uniformly distributed from 20 to 80

elevel

the education level

uniformly chosen from 0 to 4

car

car maker

uniformly chosen from 1 to 20

zipcode

zip code of the town

uniformly chosen from 0 to 8

hvalue

value of the house

uniformly distributed from 50k x zipcode to 100k x zipcode

hyears

years house owned

uniformly distributed from 1 to 30

loan

total loan amount

uniformly distributed from 0 to 500k

Parameters
classification_function: int (Default=0)

Which of the four classification functions to use for the generation. The value can vary from 0 to 9.

random_state: int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

balance_classes: bool (Default: False)

Whether to balance classes or not. If balanced, the class distribution will converge to a uniform distribution.

perturbation: float (Default: 0.0)

The probability that noise will happen in the generation. At each new sample generated, the sample with will perturbed by the amount of perturbation. Values go from 0.0 to 1.0.

References

1(1,2)

Rakesh Agrawal, Tomasz Imielinksi, and Arun Swami. “Database Mining: A Performance Perspective”, IEEE Transactions on Knowledge and Data Engineering, 5(6), December 1993.

Methods

generate_drift(self)

Generate drift by switching the classification function randomly.

get_data_info(self)

Retrieves minimum information from the stream

get_info(self)

Collects and returns the information about the configuration of the estimator

get_params(self[, deep])

Get parameters for this estimator.

has_more_samples(self)

Checks if stream has more samples.

is_restartable(self)

Determine if the stream is restartable.

last_sample(self)

Retrieves last batch_size samples in the stream.

n_remaining_samples(self)

Returns the estimated number of remaining samples.

next_sample(self[, batch_size])

Returns next sample from the stream.

prepare_for_use()

Prepare the stream for use.

reset(self)

Resets the estimator to its initial state.

restart(self)

Restart the stream.

set_params(self, **params)

Set the parameters of this estimator.

Attributes

balance_classes

Retrieve the value of the option: Balance classes

classification_function

Retrieve the index of the current classification function.

feature_names

Retrieve the names of the features.

n_cat_features

Retrieve the number of integer features.

n_features

Retrieve the number of features.

n_num_features

Retrieve the number of numerical features.

n_targets

Retrieve the number of targets

perturbation

Retrieve the value of the option: Noise percentage

target_names

Retrieve the names of the targets

target_values

Retrieve all target_values in the stream for each target.

property balance_classes

Retrieve the value of the option: Balance classes

Returns
Boolean

True is the classes are balanced

property classification_function

Retrieve the index of the current classification function.

Returns
int

index of the classification function, from 0 to 9

property feature_names

Retrieve the names of the features.

Returns
list

names of the features

generate_drift(self)[source]

Generate drift by switching the classification function randomly.

get_data_info(self)[source]

Retrieves minimum information from the stream

Used by evaluator methods to id the stream.

The default format is: ‘Stream name - n_targets, n_classes, n_features’.

Returns
string

Stream data information

get_info(self)[source]

Collects and returns the information about the configuration of the estimator

Returns
string

Configuration of the estimator.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

has_more_samples(self)[source]

Checks if stream has more samples.

Returns
Boolean

True if stream has more samples.

is_restartable(self)[source]

Determine if the stream is restartable.

Returns
Bool

True if stream is restartable.

last_sample(self)[source]

Retrieves last batch_size samples in the stream.

Returns
tuple or tuple list

A numpy.ndarray of shape (batch_size, n_features) and an array-like of shape (batch_size, n_targets), representing the next batch_size samples.

property n_cat_features

Retrieve the number of integer features.

Returns
int

The number of integer features in the stream.

property n_features

Retrieve the number of features.

Returns
int

The total number of features.

property n_num_features

Retrieve the number of numerical features.

Returns
int

The number of numerical features in the stream.

n_remaining_samples(self)[source]

Returns the estimated number of remaining samples.

Returns
int

Remaining number of samples. -1 if infinite (e.g. generator)

property n_targets

Retrieve the number of targets

Returns
int

the number of targets in the stream.

next_sample(self, batch_size=1)[source]

Returns next sample from the stream.

The sample generation works as follows: The 9 features are generated with the random generator, initialized with the seed passed by the user. Then, the classification function decides, as a function of all the attributes, whether to classify the instance as class 0 or class 1. The next step is to verify if the classes should be balanced, and if so, balance the classes. The last step is to add noise, if the noise percentage is higher than 0.0.

The generated sample will have 9 features and 1 label (it has one classification task).

Parameters
batch_size: int (optional, default=1)

The number of samples to return.

Returns
tuple or tuple list

Return a tuple with the features matrix and the labels matrix for the batch_size samples that were requested.

property perturbation

Retrieve the value of the option: Noise percentage

Returns
float
static prepare_for_use()[source]

Prepare the stream for use.

Deprecated in v0.5.0 and will be removed in v0.7.0

reset(self)[source]

Resets the estimator to its initial state.

Returns
self
restart(self)[source]

Restart the stream.

set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self
property target_names

Retrieve the names of the targets

Returns
list

the names of the targets in the stream.

property target_values

Retrieve all target_values in the stream for each target.

Returns
list

list of lists of all target_values for each target