skmultiflow.trees.HoeffdingTreeClassifier¶

class
skmultiflow.trees.
HoeffdingTreeClassifier
(max_byte_size=33554432, memory_estimate_period=1000000, grace_period=200, split_criterion='info_gain', split_confidence=1e07, tie_threshold=0.05, binary_split=False, stop_mem_management=False, remove_poor_atts=False, no_preprune=False, leaf_prediction='nba', nb_threshold=0, nominal_attributes=None)[source]¶ Hoeffding Tree or Very Fast Decision Tree classifier.
 Parameters
 max_byte_size: int (default=33554432)
Maximum memory consumed by the tree.
 memory_estimate_period: int (default=1000000)
Number of instances between memory consumption checks.
 grace_period: int (default=200)
Number of instances a leaf should observe between split attempts.
 split_criterion: string (default=’info_gain’)
 Split criterion to use.‘gini’  Gini‘info_gain’  Information Gain‘hellinger’  Helinger Distance
 split_confidence: float (default=0.0000001)
Allowed error in split decision, a value closer to 0 takes longer to decide.
 tie_threshold: float (default=0.05)
Threshold below which a split will be forced to break ties.
 binary_split: boolean (default=False)
If True, only allow binary splits.
 stop_mem_management: boolean (default=False)
If True, stop growing as soon as memory limit is hit.
 remove_poor_atts: boolean (default=False)
If True, disable poor attributes.
 no_preprune: boolean (default=False)
If True, disable prepruning.
 leaf_prediction: string (default=’nba’)
 Prediction mechanism used at leafs.‘mc’  Majority Class‘nb’  Naive Bayes‘nba’  Naive Bayes Adaptive
 nb_threshold: int (default=0)
Number of instances a leaf should observe before allowing Naive Bayes.
 nominal_attributes: list, optional
List of Nominal attributes. If emtpy, then assume that all attributes are numerical.
Notes
A Hoeffding Tree [1] is an incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time. Hoeffding trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute. This idea is supported mathematically by the Hoeffding bound, which quantifies the number of observations (in our case, examples) needed to estimate some statistics within a prescribed precision (in our case, the goodness of an attribute).
A theoretically appealing feature of Hoeffding Trees not shared by other incremental decision tree learners is that it has sound guarantees of performance. Using the Hoeffding bound one can show that its output is asymptotically nearly identical to that of a nonincremental learner using infinitely many examples.
Implementation based on MOA [2].
References
 1
G. Hulten, L. Spencer, and P. Domingos. Mining timechanging data streams. In KDD’01, pages 97–106, San Francisco, CA, 2001. ACM Press.
 2
Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer. MOA: Massive Online Analysis; Journal of Machine Learning Research 11: 16011604, 2010.
Examples
>>> # Imports >>> from skmultiflow.data import SEAGenerator >>> from skmultiflow.trees import HoeffdingTreeClassifier >>> >>> # Setting up a data stream >>> stream = SEAGenerator(random_state=1) >>> >>> # Setup Hoeffding Tree estimator >>> ht = HoeffdingTreeClassifier() >>> >>> # Setup variables to control loop and track performance >>> n_samples = 0 >>> correct_cnt = 0 >>> max_samples = 200 >>> >>> # Train the estimator with the samples provided by the data stream >>> while n_samples < max_samples and stream.has_more_samples(): >>> X, y = stream.next_sample() >>> y_pred = ht.predict(X) >>> if y[0] == y_pred[0]: >>> correct_cnt += 1 >>> ht = ht.partial_fit(X, y) >>> n_samples += 1 >>> >>> # Display results >>> print('{} samples analyzed.'.format(n_samples)) >>> print('Hoeffding Tree accuracy: {}'.format(correct_cnt / n_samples))
Methods
compute_hoeffding_bound
(range_val, confidence, n)Compute the Hoeffding bound, used to decide how many samples are necessary at each node.
deactivate_all_leaves
(self)Deactivate all leaves.
enforce_tracker_limit
(self)Track the size of the tree and disable/enable nodes if required.
estimate_model_byte_size
(self)Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.
fit
(self, X, y[, classes, sample_weight])Fit the model.
get_info
(self)Collects and returns the information about the configuration of the estimator
get_model_description
(self)Walk the tree and return its structure in a buffer.
get_model_rules
(self)Returns list of list describing the tree.
get_params
(self[, deep])Get parameters for this estimator.
get_rules_description
(self)Prints the the description of tree using rules.
get_votes_for_instance
(self, X)Get class votes for a single instance.
measure_byte_size
(self)Calculate the size of the tree.
measure_tree_depth
(self)Calculate the depth of the tree.
new_split_node
(self, split_test, …)Create a new split node.
partial_fit
(self, X, y[, classes, sample_weight])Incrementally trains the model.
predict
(self, X)Predicts the label of the X instance(s)
predict_proba
(self, X)Predicts probabilities of all label of the X instance(s)
reset
(self)Reset the Hoeffding Tree to default values.
score
(self, X, y[, sample_weight])Returns the mean accuracy on the given test data and labels.
set_params
(self, **params)Set the parameters of this estimator.
Attributes
binary_split
classes
Collect metrics corresponding to the current status of the tree.
grace_period
leaf_prediction
max_byte_size
memory_estimate_period
nb_threshold
no_preprune
nominal_attributes
remove_poor_atts
split_confidence
split_criterion
stop_mem_management
tie_threshold

static
compute_hoeffding_bound
(range_val, confidence, n)[source]¶ Compute the Hoeffding bound, used to decide how many samples are necessary at each node.
 Parameters
 range_val: float
Range value.
 confidence: float
Confidence of choosing the correct attribute.
 n: int or float
Number of samples.
 Returns
 float
The Hoeffding bound.
Notes
The Hoeffding bound is defined as:
\[\epsilon = \sqrt{\frac{R^2\ln(1/\delta))}{2n}}\]where:
\(\epsilon\): Hoeffding bound.
\(R\): Range of a random variable. For a probability the range is 1, and for an information gain the range is log c, where c is the number of classes.
\(\delta\): Confidence. 1 minus the desired probability of choosing the correct attribute at any given node.
\(n\): Number of samples.

enforce_tracker_limit
(self)[source]¶ Track the size of the tree and disable/enable nodes if required.

estimate_model_byte_size
(self)[source]¶ Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.

fit
(self, X, y, classes=None, sample_weight=None)[source]¶ Fit the model.
 Parameters
 Xnumpy.ndarray of shape (n_samples, n_features)
The features to train the model.
 y: numpy.ndarray of shape (n_samples, n_targets)
An arraylike with the class labels of all samples in X.
 classes: numpy.ndarray, optional (default=None)
Contains all possible/known class labels. Usage varies depending on the learning method.
 sample_weight: numpy.ndarray, optional (default=None)
Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.
 Returns
 self

get_info
(self)[source]¶ Collects and returns the information about the configuration of the estimator
 Returns
 string
Configuration of the estimator.

get_model_description
(self)[source]¶ Walk the tree and return its structure in a buffer.
 Returns
 string
The description of the model.

property
get_model_measurements
¶ Collect metrics corresponding to the current status of the tree.
 Returns
 string
A string buffer containing the measurements of the tree.

get_model_rules
(self)[source]¶ Returns list of list describing the tree.
 Returns
 list (Rule)
list of the rules describing the tree

get_params
(self, deep=True)[source]¶ Get parameters for this estimator.
 Parameters
 deepboolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsmapping of string to any
Parameter names mapped to their values.

get_votes_for_instance
(self, X)[source]¶ Get class votes for a single instance.
 Parameters
 X: numpy.ndarray of length equal to the number of features.
Instance attributes.
 Returns
 dict (class_value, weight)

measure_byte_size
(self)[source]¶ Calculate the size of the tree.
 Returns
 int
Size of the tree in bytes.

partial_fit
(self, X, y, classes=None, sample_weight=None)[source]¶ Incrementally trains the model. Train samples (instances) are composed of X attributes and their corresponding targets y.
 Parameters
 X: numpy.ndarray of shape (n_samples, n_features)
Instance attributes.
 y: array_like
Classes (targets) for all samples in X.
 classes: numpy.array
Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
 sample_weight: float or arraylike
Samples weight. If not provided, uniform weights are assumed.
 Returns
 self
Notes
Tasks performed before training:
Verify instance weight. if not provided, uniform weights (1.0) are assumed.
If more than one instance is passed, loop through X and pass instances one at a time.
Update weight seen by model.
Training tasks:
If the tree is empty, create a leaf node as the root.
If the tree is already initialized, find the corresponding leaf for the instance and update the leaf node statistics.
If growth is allowed and the number of instances that the leaf has observed between split attempts exceed the grace period then attempt to split.

predict
(self, X)[source]¶ Predicts the label of the X instance(s)
 Parameters
 X: numpy.ndarray of shape (n_samples, n_features)
Samples for which we want to predict the labels.
 Returns
 numpy.array
Predicted labels for all instances in X.

predict_proba
(self, X)[source]¶ Predicts probabilities of all label of the X instance(s)
 Parameters
 X: numpy.ndarray of shape (n_samples, n_features)
Samples for which we want to predict the labels.
 Returns
 numpy.array
Predicted the probabilities of all the labels for all instances in X.

score
(self, X, y, sample_weight=None)[source]¶ Returns the mean accuracy on the given test data and labels.
In multilabel classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
 Parameters
 Xarraylike, shape = (n_samples, n_features)
Test samples.
 yarraylike, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
 sample_weightarraylike, shape = [n_samples], optional
Sample weights.
 Returns
 scorefloat
Mean accuracy of self.predict(X) wrt. y.

set_params
(self, **params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object. Returns
 self