skmultiflow.trees.ExtremelyFastDecisionTreeClassifier¶

class skmultiflow.trees.ExtremelyFastDecisionTreeClassifier(max_byte_size=33554432, memory_estimate_period=1000000, grace_period=200, min_samples_reevaluate=20, split_criterion='info_gain', split_confidence=1e-07, tie_threshold=0.05, binary_split=False, stop_mem_management=False, leaf_prediction='nba', nb_threshold=0, nominal_attributes=None)[source]¶

Extremely Fast Decision Tree classifier.

Parameters

max_byte_size: int (default=33554432): Maximum memory consumed by the tree.
memory_estimate_period: int (default=1000000): Number of instances between memory consumption checks.
grace_period: int (default=200): Number of instances a leaf should observe between split attempts.
min_samples_reevaluate: int (default=20): Number of instances a node should observe before reevaluate the best split.
split_criterion: string (default=’info_gain’): Split criterion to use.

‘gini’ - Gini

‘info_gain’ - Information Gain
split_confidence: float (default=0.0000001): Allowed error in split decision, a value closer to 0 takes longer to decide.
tie_threshold: float (default=0.05): Threshold below which a split will be forced to break ties.
binary_split: boolean (default=False): If True, only allow binary splits.
stop_mem_management: boolean (default=False): If True, stop growing as soon as memory limit is hit.
leaf_prediction: string (default=’nba’): Prediction mechanism used at leafs.

‘mc’ - Majority Class

‘nb’ - Naive Bayes

‘nba’ - Naive Bayes Adaptive
nb_threshold: int (default=0): Number of instances a leaf should observe before allowing Naive Bayes.
nominal_attributes: list, optional: List of Nominal attributes. If emtpy, then assume that all attributes are numerical.

Notes

The Extremely Fast Decision Tree (EFDT) [1] constructs a tree incrementally. The EFDT seeks to select and deploy a split as soon as it is confident the split is useful, and then revisits that decision, replacing the split if it subsequently becomes evident that a better split is available. The EFDT learns rapidly from a stationary distribution and eventually it learns the asymptotic batch tree if the distribution from which the data are drawn is stationary.

References

1: C. Manapragada, G. Webb, and M. Salehi. Extremely Fast Decision Tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘18). ACM, New York, NY, USA, 1953-1962. DOI: https://doi.org/10.1145/3219819.3220005

Examples

>>> # Imports
>>> from skmultiflow.data import SEAGenerator
>>> from skmultiflow.trees import ExtremelyFastDecisionTreeClassifier
>>>
>>> # Setting up a data stream
>>> stream = SEAGenerator(random_state=1)
>>>
>>> # Setup Extremely Fast Decision Tree classifier
>>> efdt = ExtremelyFastDecisionTreeClassifier()
>>>
>>> # Setup variables to control loop and track performance
>>> n_samples = 0
>>> correct_cnt = 0
>>> max_samples = 200
>>>
>>> # Train the estimator with the samples provided by the data stream
>>> while n_samples < max_samples and stream.has_more_samples():
>>>     X, y = stream.next_sample()
>>>     y_pred = efdt.predict(X)
>>>     if y[0] == y_pred[0]:
>>>         correct_cnt += 1
>>>     efdt.partial_fit(X, y)
>>>     n_samples += 1
>>>
>>> # Display results
>>> print('{} samples analyzed.'.format(n_samples))
>>> print('Extremely Fast Decision Tree accuracy: {}'.format(correct_cnt / n_samples))

Methods

`compute_hoeffding_bound`(range_val, confidence, n)	Compute the Hoeffding bound, used to decide how many samples are necessary at each node.
`deactivate_all_leaves`(self)	Deactivate all leaves.
`enforce_tracker_limit`(self)	Track the size of the tree and disable/enable nodes if required.
`estimate_model_byte_size`(self)	Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.
`fit`(self, X, y[, classes, sample_weight])	Fit the model.
`get_info`(self)	Collects and returns the information about the configuration of the estimator
`get_model_description`(self)	Walk the tree and return its structure in a buffer.
`get_model_rules`(self)	Returns list of list describing the tree.
`get_params`(self[, deep])	Get parameters for this estimator.
`get_rules_description`(self)	Prints the the description of tree using rules.
`get_votes_for_instance`(self, X)	Get class votes for a single instance.
`measure_byte_size`(self)	Calculate the size of the tree.
`measure_tree_depth`(self)	Calculate the depth of the tree.
`new_split_node`(self, split_test, …)	Create a new split node.
`partial_fit`(self, X, y[, classes, sample_weight])	Incrementally trains the model.
`predict`(self, X)	Predicts the label of the X instance(s)
`predict_proba`(self, X)	Predicts probabilities of all label of the X instance(s)
`reset`(self)	Reset the Hoeffding Anytime Tree to default values.
`score`(self, X, y[, sample_weight])	Returns the mean accuracy on the given test data and labels.
`set_params`(self, **params)	Set the parameters of this estimator.

Attributes

`binary_split`
`classes`
`get_model_measurements`	Collect metrics corresponding to the current status of the tree.
`grace_period`
`leaf_prediction`
`max_byte_size`
`memory_estimate_period`
`nb_threshold`
`no_preprune`
`nominal_attributes`
`remove_poor_atts`
`split_confidence`
`split_criterion`
`stop_mem_management`
`tie_threshold`

static compute_hoeffding_bound(range_val, confidence, n)[source]¶

Compute the Hoeffding bound, used to decide how many samples are necessary at each node.

Parameters

range_val: float: Range value.
confidence: float: Confidence of choosing the correct attribute.
n: int or float: Number of samples.

Returns

float: The Hoeffding bound.

Notes

The Hoeffding bound is defined as:

\[\epsilon = \sqrt{\frac{R^2\ln(1/\delta))}{2n}}\]

where:

\(\epsilon\): Hoeffding bound.

\(R\): Range of a random variable. For a probability the range is 1, and for an information gain the range is log c, where c is the number of classes.

\(\delta\): Confidence. 1 minus the desired probability of choosing the correct attribute at any given node.

\(n\): Number of samples.

deactivate_all_leaves(self)[source]¶: Deactivate all leaves.

enforce_tracker_limit(self)[source]¶: Track the size of the tree and disable/enable nodes if required.

estimate_model_byte_size(self)[source]¶: Calculate the size of the model and trigger tracker function if the actual model size exceeds the max size in the configuration.

fit(self, X, y, classes=None, sample_weight=None)[source]¶

Fit the model.

Parameters

Xnumpy.ndarray of shape (n_samples, n_features): The features to train the model.
y: numpy.ndarray of shape (n_samples, n_targets): An array-like with the class labels of all samples in X.
classes: numpy.ndarray, optional (default=None): Contains all possible/known class labels. Usage varies depending on the learning method.
sample_weight: numpy.ndarray, optional (default=None): Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.

Returns

self

get_info(self)[source]¶

Collects and returns the information about the configuration of the estimator

Returns

string: Configuration of the estimator.

get_model_description(self)[source]¶

Walk the tree and return its structure in a buffer.

Returns

string: The description of the model.

property get_model_measurements¶

Collect metrics corresponding to the current status of the tree.

Returns

string: A string buffer containing the measurements of the tree.

get_model_rules(self)[source]¶

Returns list of list describing the tree.

Returns

list (Rule): list of the rules describing the tree

get_params(self, deep=True)[source]¶

Get parameters for this estimator.

Parameters

deepboolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

get_rules_description(self)[source]¶: Prints the the description of tree using rules.

get_votes_for_instance(self, X)[source]¶

Get class votes for a single instance.

Parameters

X: numpy.ndarray of length equal to the number of features.: Instance attributes.

Returns

dict (class_value, weight)

measure_byte_size(self)[source]¶

Calculate the size of the tree.

Returns

int: Size of the tree in bytes.

measure_tree_depth(self)[source]¶

Calculate the depth of the tree.

Returns

int: Depth of the tree.

new_split_node(self, split_test, class_observations, attribute_observers)[source]¶: Create a new split node.

partial_fit(self, X, y, classes=None, sample_weight=None)[source]¶

Incrementally trains the model. Train samples (instances) are composed of X attributes and their corresponding targets y.

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Instance attributes.
y: array_like: Classes (targets) for all samples in X.
classes: list or numpy.array: Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
sample_weight: float or array-like: Samples weight. If not provided, uniform weights are assumed.

Notes

Tasks performed before training:

Verify instance weight. if not provided, uniform weights (1.0) are assumed.
If more than one instance is passed, loop through X and pass instances one at a time.
Update weight seen by model.

Training tasks:

If the tree is empty, create a leaf node as the root.
If the tree is already initialized, find the path from root to the corresponding leaf for the instance and sort the instance.
- Reevaluate the best split for each internal node.
- Attempt to split the leaf.

predict(self, X)[source]¶

Predicts the label of the X instance(s)

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Samples for which we want to predict the labels.

Returns

numpy.array: Predicted labels for all instances in X.

predict_proba(self, X)[source]¶

Predicts probabilities of all label of the X instance(s)

Parameters

X: numpy.ndarray of shape (n_samples, n_features): Samples for which we want to predict the labels.

Returns

numpy.array: Predicted the probabilities of all the labels for all instances in X.

reset(self)[source]¶: Reset the Hoeffding Anytime Tree to default values.

score(self, X, y, sample_weight=None)[source]¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like, shape = (n_samples, n_features): Test samples.
yarray-like, shape = (n_samples) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like, shape = [n_samples], optional: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(self, **params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self