scikit-multiflow
In this example, we will use a data stream to train a HoeffdingTreeClassifier and will measure its performance using prequential evaluation:
HoeffdingTreeClassifier
Create a stream
The WaveformGenerator generates by default samples with 21 numeric attributes and 3 target_values, based on a random differentiation of some base waveforms:
WaveformGenerator
>>> stream = WaveformGenerator()
Instantiate the Hoeffding Tree classifier
We will use the default parameters.
>>> ht = HoeffdingTreeClassifier()
Setup the evaluator, we will use the EvaluatePrequential class.
EvaluatePrequential
>>> evaluator = EvaluatePrequential(show_plot=True, >>> pretrain_size=200, >>> max_samples=20000)
show_plot=True to get a dynamic plot that is updated as the classifier is trained.
show_plot=True
pretrain_size=200 sets the number of samples passed in the first train call.
pretrain_size=200
max_sample=20000 sets the maximum number of samples to use.
max_sample=20000
Run the evaluation
By calling evaluate(), we pass control to the evaluator, which will perform the following sub-tasks:
evaluate()
Check if there are samples in the stream
test the classifier (using predict())
predict()
update the classifier (using partial_fit())
partial_fit()
Update the evaluation results and plot
evaluator.evaluate(stream=stream, model=ht)
Putting it all together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
>>> from skmultiflow.data import WaveformGenerator >>> from skmultiflow.trees import HoeffdingTreeClassifier >>> from skmultiflow.evaluation import EvaluatePrequential >>> >>> # 1. Create a stream >>> stream = WaveformGenerator() >>> >>> # 2. Instantiate the HoeffdingTreeClassifier >>> ht = HoeffdingTreeClassifier() >>> >>> # 3. Setup the evaluator >>> evaluator = EvaluatePrequential(show_plot=True, >>> pretrain_size=200, >>> max_samples=20000) >>> >>> # 4. Run evaluation >>> evaluator.evaluate(stream=stream, model=ht)
Note: Since we set show_plot=True, a new window will be created for the plot:
There are cases where we want to use data stored in files. In this example we will train a HoeffdingTreeClassifier, but this time we will read the data from a (csv) file and will write the results of the evaluation into a (csv) file.
Load the data set as a stream
For this purpose we will use the FileStream class:
FileStream
>>> stream = FileStream(filepath)
filepath. A string indicating the path where the data file is located.
filepath
The FileStream class will generate a stream using the data contained in the file.
>>> evaluator = EvaluatePrequential(pretrain_size=1000, >>> max_samples=10000, >>> output_file='results.csv')
pretrain_size=1000 sets the number of samples passed in the first train call.
pretrain_size=1000
max_samples=100000 sets the maximum number of samples to use.
max_samples=100000
output_file='results.csv' indicates that the results should be stored into a file. In this case a file results.csv will be created in the current path.
output_file='results.csv'
Pass the next sample to the classifier: - test the classifier (using predict()) - update the classifier (using partial_fit())
Write results to output_file
When the test finishes, the results.csv file will be available in the current path.
The file contains information related to the test that generated the file. For this example:
# TEST CONFIGURATION BEGIN # File Stream: filename: elec.csv - n_targets: 1 # [0] HoeffdingTreeClassifier: max_byte_size: 33554432 - memory_estimate_period: 1000000 - grace_period: 200 - split_criterion: info_gain - split_confidence: 1e-07 - tie_threshold: 0.05 - binary_split: False - stop_mem_management: False - remove_poor_atts: False - no_pre_prune: False - leaf_prediction: nba - nb_threshold: 0 - nominal_attributes: [] - # Prequential Evaluator: n_wait: 200 - max_samples: 10000 - max_time: inf - output_file: results.csv - batch_size: 1 - pretrain_size: 1000 - task_type: classification - show_plot: False - metrics: ['performance', 'kappa'] # TEST CONFIGURATION END
And data related to performance during the evaluation:
id: the id of the sample that was used for testing
id
global_performance: overall performance (accuracy)
global_performance
sliding_performance: sliding window performance (accuracy)
sliding_performance
global_kappa: overall kappa statistics
global_kappa
sliding_kappa: sliding window kappa statistics
sliding_kappa
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
>>> from skmultiflow.data import FileStream >>> from skmultiflow.trees import HoeffdingTreeClassifier >>> from skmultiflow.evaluation import EvaluatePrequential >>> >>> # 1. Create a stream >>> stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/" >>> "streaming-datasets/master/elec.csv") >>> >>> # 2. Instantiate the HoeffdingTreeClassifier >>> ht = HoeffdingTreeClassifier() >>> >>> # 3. Setup the evaluator >>> evaluator = EvaluatePrequential(pretrain_size=1000, >>> max_samples=10000, >>> output_file='results.csv') >>> >>> # 4. Run evaluation >>> evaluator.evaluate(stream=stream, model=ht)
Note: The elec.csv file is available in the following repository: https://github.com/scikit-multiflow/streaming-datasets
elec.csv
To avoid downloading the data multiple times, you can keep a local copy and replace the path accordingly.