skmultiflow.trees.
ExtremelyFastDecisionTreeClassifier
Extremely Fast Decision Tree classifier.
Maximum memory consumed by the tree.
Number of instances between memory consumption checks.
Number of instances a leaf should observe between split attempts.
Number of instances a node should observe before reevaluate the best split.
Allowed error in split decision, a value closer to 0 takes longer to decide.
Threshold below which a split will be forced to break ties.
If True, only allow binary splits.
If True, stop growing as soon as memory limit is hit.
Number of instances a leaf should observe before allowing Naive Bayes.
List of Nominal attributes. If emtpy, then assume that all attributes are numerical.
Notes
The Extremely Fast Decision Tree (EFDT) [1] constructs a tree incrementally. The EFDT seeks to select and deploy a split as soon as it is confident the split is useful, and then revisits that decision, replacing the split if it subsequently becomes evident that a better split is available. The EFDT learns rapidly from a stationary distribution and eventually it learns the asymptotic batch tree if the distribution from which the data are drawn is stationary.
References
C. Manapragada, G. Webb, and M. Salehi. Extremely Fast Decision Tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘18). ACM, New York, NY, USA, 1953-1962. DOI: https://doi.org/10.1145/3219819.3220005
Examples
>>> # Imports >>> from skmultiflow.data import SEAGenerator >>> from skmultiflow.trees import ExtremelyFastDecisionTreeClassifier >>> >>> # Setting up a data stream >>> stream = SEAGenerator(random_state=1) >>> >>> # Setup Extremely Fast Decision Tree classifier >>> efdt = ExtremelyFastDecisionTreeClassifier() >>> >>> # Setup variables to control loop and track performance >>> n_samples = 0 >>> correct_cnt = 0 >>> max_samples = 200 >>> >>> # Train the estimator with the samples provided by the data stream >>> while n_samples < max_samples and stream.has_more_samples(): >>> X, y = stream.next_sample() >>> y_pred = efdt.predict(X) >>> if y[0] == y_pred[0]: >>> correct_cnt += 1 >>> efdt.partial_fit(X, y) >>> n_samples += 1 >>> >>> # Display results >>> print('{} samples analyzed.'.format(n_samples)) >>> print('Extremely Fast Decision Tree accuracy: {}'.format(correct_cnt / n_samples))
Methods
fit(X, y[, classes, sample_weight])
fit
Fit the model.
get_info()
get_info
Collects and returns the information about the configuration of the estimator
get_model_description()
get_model_description
Walk the tree and return its structure in a buffer.
get_model_rules()
get_model_rules
Returns list of rules describing the tree.
get_params([deep])
get_params
Get parameters for this estimator.
get_rules_description()
get_rules_description
Prints the description of tree using rules.
measure_byte_size()
measure_byte_size
Calculate the size of the tree.
partial_fit(X, y[, classes, sample_weight])
partial_fit
Incrementally trains the model.
predict(X)
predict
Predicts the label of the X instance(s)
predict_proba(X)
predict_proba
Predicts probabilities of all label of the X instance(s)
reset()
reset
Reset the Hoeffding Tree to default values.
score(X, y[, sample_weight])
score
Returns the mean accuracy on the given test data and labels.
set_params(**params)
set_params
Set the parameters of this estimator.
Attributes
leaf_prediction
model_measurements
Collect metrics corresponding to the current status of the tree.
split_criterion
The features to train the model.
An array-like with the class labels of all samples in X.
Contains all possible/known class labels. Usage varies depending on the learning method.
Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.
Configuration of the estimator.
The description of the model.
list of the rules describing the tree
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Parameter names mapped to their values.
Size of the tree in bytes.
A string buffer containing the measurements of the tree.
Incrementally trains the model. Train samples (instances) are composed of X attributes and their corresponding targets y.
Instance attributes.
Classes (targets) for all samples in X.
Contains the class values in the stream. If defined, will be used to define the length of the arrays returned by predict_proba
Samples weight. If not provided, uniform weights are assumed.
Tasks performed before training:
Verify instance weight. if not provided, uniform weights (1.0) are assumed.
If more than one instance is passed, loop through X and pass instances one at a time.
Update weight seen by model.
Training tasks:
If the tree is empty, create a leaf node as the root.
If the tree is already initialized, find the path from root to the corresponding leaf for
the instance and sort the instance.
Reevaluate the best split for each internal node. Attempt to split the leaf.
Reevaluate the best split for each internal node.
Attempt to split the leaf.
Samples for which we want to predict the labels.
Predicted labels for all instances in X.
Predicted the probabilities of all the labels for all instances in X.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Test samples.
True labels for X.
Sample weights.
Mean accuracy of self.predict(X) wrt. y.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
<component>__<parameter>