skmultiflow.trees.HoeffdingAdaptiveTreeClassifier¶

class skmultiflow.trees.HoeffdingAdaptiveTreeClassifier(max_byte_size=33554432, memory_estimate_period=1000000, grace_period=200, split_criterion='info_gain', split_confidence=1e-07, tie_threshold=0.05, binary_split=False, stop_mem_management=False, remove_poor_atts=False, no_preprune=False, leaf_prediction='nba', nb_threshold=0, nominal_attributes=None, bootstrap_sampling=True, random_state=None)[source]¶

Hoeffding Adaptive Tree classifier.

Parameters

max_byte_size: int (default=33554432)

Maximum memory consumed by the tree.

memory_estimate_period: int (default=1000000)

Number of instances between memory consumption checks.

grace_period: int (default=200)

Number of instances a leaf should observe between split attempts.

split_criterion: string (default=’info_gain’)

Split criterion to use.

‘gini’ - Gini
‘info_gain’ - Information Gain

split_confidence: float (default=0.0000001)

Allowed error in split decision, a value closer to 0 takes longer to decide.

tie_threshold: float (default=0.05)

Threshold below which a split will be forced to break ties.

binary_split: boolean (default=False)

If True, only allow binary splits.

stop_mem_management: boolean (default=False)

If True, stop growing as soon as memory limit is hit.

remove_poor_atts: boolean (default=False)

If True, disable poor attributes.

no_preprune: boolean (default=False)

If True, disable pre-pruning.

leaf_prediction: string (default=’nba’)

Prediction mechanism used at leafs.

‘mc’ - Majority Class
‘nb’ - Naive Bayes
‘nba’ - Naive Bayes Adaptive

nb_threshold: int (default=0)

Number of instances a leaf should observe before allowing Naive Bayes.

nominal_attributes: list, optional

List of Nominal attributes. If emtpy, then assume that all attributes are numerical.

bootstrap_sampling: bool, optional (default=True)

If True, perform bootstrap sampling in the leaf nodes.

random_state: int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Only used when bootstrap_sampling=True to direct the bootstrap sampling.

Notes

The Hoeffding Adaptive Tree [1] uses ADWIN [2] to monitor performance of branches on the tree and to replace them with new branches when their accuracy decreases if the new branches are more accurate.

The bootstrap sampling strategy is an improvement over the original Hoeffding Adaptive Tree algorithm. It is enabled by default since, in general, it results in better performance.

References

1: Bifet, Albert, and Ricard Gavaldà. “Adaptive learning from evolving data streams.” In International Symposium on Intelligent Data Analysis, pp. 249-260. Springer, Berlin, Heidelberg, 2009.
2: Bifet, Albert, and Ricard Gavaldà. “Learning from time-changing data with adaptive windowing.” In Proceedings of the 2007 SIAM international conference on data mining, pp. 443-448. Society for Industrial and Applied Mathematics, 2007.

Examples

>>> from skmultiflow.trees import HoeffdingAdaptiveTreeClassifier
>>> from skmultiflow.data import ConceptDriftStream
>>> from skmultiflow.evaluation import EvaluatePrequential
>>> # Setup the File Stream
>>> stream = ConceptDriftStream(random_state=123456, position=25000)
>>>
>>> classifier = HoeffdingAdaptiveTreeClassifier()
>>> evaluator = EvaluatePrequential(pretrain_size=200, max_samples=50000, batch_size=1,
>>>                                 n_wait=200, max_time=1000, output_file=None,
>>>                                 show_plot=True, metrics=['kappa', 'kappa_t', 'accuracy'])
>>>
>>> evaluator.evaluate(stream=stream, model=classifier)

Methods

`compute_hoeffding_bound`(range_val, confidence, n)	Compute the Hoeffding bound, used to decide how many samples are necessary at each node.
`deactivate_all_leaves`(self)	Deactivate all leaves.
`enforce_tracker_limit`(self)	Track the size of the tree and disable/enable nodes if required.
`estimate_model_byte_size`(self)	Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.
`filter_instance_to_leaves`(self, X, y, …)
`fit`(self, X, y[, classes, sample_weight])	Fit the model.
`get_info`(self)	Collects and returns the information about the configuration of the estimator
`get_model_description`(self)	Walk the tree and return its structure in a buffer.
`get_model_rules`(self)	Returns list of list describing the tree.
`get_params`(self[, deep])	Get parameters for this estimator.
`get_rules_description`(self)	Prints the the description of tree using rules.
`get_votes_for_instance`(self, X)	Get class votes for a single instance.
`measure_byte_size`(self)	Calculate the size of the tree.
`measure_tree_depth`(self)	Calculate the depth of the tree.
`new_split_node`(self, split_test, …)	Create a new split node.
`partial_fit`(self, X, y[, classes, sample_weight])	Incrementally trains the model.
`predict`(self, X)	Predicts the label of the X instance(s)
`predict_proba`(self, X)	Predicts probabilities of all label of the X instance(s)
`reset`(self)	Reset the Hoeffding Tree to default values.
`score`(self, X, y[, sample_weight])	Returns the mean accuracy on the given test data and labels.
`set_params`(self, **params)	Set the parameters of this estimator.

Attributes

`binary_split`
`classes`
`get_model_measurements`	Collect metrics corresponding to the current status of the tree.
`grace_period`
`leaf_prediction`
`max_byte_size`
`memory_estimate_period`
`nb_threshold`
`no_preprune`
`nominal_attributes`
`remove_poor_atts`
`split_confidence`
`split_criterion`
`stop_mem_management`
`tie_threshold`

static compute_hoeffding_bound(range_val, confidence, n)[source]¶

Compute the Hoeffding bound, used to decide how many samples are necessary at each node.

Parameters

range_val: float: Range value.
confidence: float: Confidence of choosing the correct attribute.
n: int or float: Number of samples.

Returns

float: The Hoeffding bound.

Notes

The Hoeffding bound is defined as:

\[\epsilon = \sqrt{\frac{R^2\ln(1/\delta))}{2n}}\]

where:

\(\epsilon\): Hoeffding bound.

\(R\): Range of a random variable. For a probability the range is 1, and for an information gain the range is log c, where c is the number of classes.

\(\delta\): Confidence. 1 minus the desired probability of choosing the correct attribute at any given node.

\(n\): Number of samples.

deactivate_all_leaves(self)[source]¶: Deactivate all leaves.

enforce_tracker_limit(self)[source]¶: Track the size of the tree and disable/enable nodes if required.

estimate_model_byte_size(self)[source]¶: Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.

fit(self, X, y, classes=None, sample_weight=None)[source]¶

Fit the model.

Parameters

Xnumpy.ndarray of shape (n_samples, n_features): The features to train the model.
y: numpy.ndarray of shape (n_samples, n_targets): An array-like with the class labels of all samples in X.
classes: numpy.ndarray, optional (default=None): Contains all possible/known class labels. Usage varies depending on the learning method.
sample_weight: numpy.ndarray, optional (default=None): Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.

Returns

self

get_info(self)[source]¶

Collects and returns the information about the configuration of the estimator

Returns

string: Configuration of the estimator.

get_model_description(self)[source]¶

Walk the tree and return its structure in a buffer.

Returns

string: The description of the model.

property get_model_measurements¶

Collect metrics corresponding to the current status of the tree.

Returns

string: A string buffer containing the measurements of the tree.

get_model_rules(self)[source]¶

Returns list of list describing the tree.

Returns

list (Rule): list of the rules describing the tree

get_params(self, deep=True)[source]¶

Get parameters for this estimator.

Parameters

deepboolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

get_rules_description(self)[source]¶: Prints the the description of tree using rules.

get_votes_for_instance(self, X)[source]¶

Get class votes for a single instance.

Parameters

X: numpy.ndarray of length equal to the number of features.: Instance attributes.

Returns

dict (class_value, weight)

measure_byte_size(self)[source]¶

Calculate the size of the tree.

Returns

int: Size of the tree in bytes.

measure_tree_depth(self)[source]¶

Calculate the depth of the tree.

Returns

int: Depth of the tree.

new_split_node(self, split_test, class_observations)[source]¶: Create a new split node.

partial_fit(self, X, y, classes=None, sample_weight=None)[source]¶

Incrementally trains the model. Train samples (instances) are composed of X attributes and their corresponding targets y.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Instance attributes.
y: array_like: Classes (targets) for all samples in X.
classes: numpy.array: Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
sample_weight: float or array-like: Samples weight. If not provided, uniform weights are assumed.

Returns

self

Notes

Tasks performed before training:

Verify instance weight. if not provided, uniform weights (1.0) are assumed.
If more than one instance is passed, loop through X and pass instances one at a time.
Update weight seen by model.

Training tasks:

If the tree is empty, create a leaf node as the root.
If the tree is already initialized, find the corresponding leaf for the instance and update the leaf node statistics.
If growth is allowed and the number of instances that the leaf has observed between split attempts exceed the grace period then attempt to split.

predict(self, X)[source]¶

Predicts the label of the X instance(s)

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Samples for which we want to predict the labels.

Returns

numpy.array: Predicted labels for all instances in X.

predict_proba(self, X)[source]¶

Predicts probabilities of all label of the X instance(s)

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Samples for which we want to predict the labels.

Returns

numpy.array: Predicted the probabilities of all the labels for all instances in X.

reset(self)[source]¶: Reset the Hoeffding Tree to default values.

score(self, X, y, sample_weight=None)[source]¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like, shape = (n_samples, n_features): Test samples.
yarray-like, shape = (n_samples) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like, shape = [n_samples], optional: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(self, **params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self