skmultiflow.lazy.
KNNClassifier
k-Nearest Neighbors classifier.
This non-parametric classification method keeps track of the last max_window_size training samples. The predicted class-label for a given query sample is obtained in two steps:
max_window_size
Find the closest n_neighbors to the query sample in the data window.
Aggregate the class-labels of the n_neighbors to define the predicted class for the query sample.
The number of nearest neighbors to search for.
The maximum size of the window storing the last observed samples.
sklearn.KDTree parameter. The maximum number of samples that can be stored in one leaf node, which determines from which point the algorithm will switch for a brute-force approach. The bigger this number the faster the tree construction time, but the slower the query time will be.
sklearn.KDTree parameter. The distance metric to use for the KDTree. Default=’euclidean’. KNNClassifier.valid_metrics() gives a list of the metrics which are valid for KDTree.
Notes
This estimator is not optimal for a mixture of categorical and numerical features. This implementation treats all features from a given stream as numerical.
Examples
>>> # Imports >>> from skmultiflow.lazy import KNNClassifier >>> from skmultiflow.data import SEAGenerator >>> # Setting up the stream >>> stream = SEAGenerator(random_state=1, noise_percentage=.1) >>> knn = KNNClassifier(n_neighbors=8, max_window_size=2000, leaf_size=40) >>> # Keep track of sample count and correct prediction count >>> n_samples = 0 >>> corrects = 0 >>> while n_samples < 5000: ... X, y = stream.next_sample() ... my_pred = knn.predict(X) ... if y[0] == my_pred[0]: ... corrects += 1 ... knn = knn.partial_fit(X, y) ... n_samples += 1 >>> >>> # Displaying results >>> print('KNNClassifier usage example') >>> print('{} samples analyzed.'.format(n_samples)) 5000 samples analyzed. >>> print("KNNClassifier's performance: {}".format(corrects/n_samples)) KNN's performance: 0.8776
Methods
fit(self, X, y[, classes, sample_weight])
fit
Fit the model.
get_info(self)
get_info
Collects and returns the information about the configuration of the estimator
get_params(self[, deep])
get_params
Get parameters for this estimator.
partial_fit(self, X, y[, classes, sample_weight])
partial_fit
Partially (incrementally) fit the model.
predict(self, X)
predict
Predict the class label for sample X
predict_proba(self, X)
predict_proba
Estimate the probability of X belonging to each class-labels.
reset(self)
reset
Reset estimator.
score(self, X, y[, sample_weight])
score
Returns the mean accuracy on the given test data and labels.
set_params(self, **params)
set_params
Set the parameters of this estimator.
valid_metrics()
valid_metrics
Get valid distance metrics for the KDTree.
The features to train the model.
An array-like with the class labels of all samples in X.
Contains all possible/known class labels. Usage varies depending on the learning method.
Samples weight. If not provided, uniform weights are assumed. Usage varies depending on the learning method.
Configuration of the estimator.
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Parameter names mapped to their values.
The data upon which the algorithm will create its model.
An array-like containing the classification targets for all samples in X.
Array with all possible/known classes.
self
For the K-Nearest Neighbors Classifier, fitting the model is the equivalent of inserting the newer samples in the observed window, and if the size_limit is reached, removing older results. To store the viewed samples we use a InstanceWindow object. For this class’ documentation please visit skmultiflow.core.utils.data_structures
All the samples we want to predict the label for.
A 1D array of shape (, n_samples), containing the predicted class labels for all instances in X.
A 2D array of shape (n_samples, n_classes). Where each i-th row contains len(self.target_value) elements, representing the probability that the i-th sample of X belongs to a certain class label.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Test samples.
True labels for X.
Sample weights.
Mean accuracy of self.predict(X) wrt. y.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
<component>__<parameter>