Stream generators are a cheap source of data, since data samples are generated on demand we can avoid storing data physically. There are multiple stream generators in scikit-multiflow and all of them work in a similar way.
scikit-multiflow
Here, we will use the AGRAWALGenerator to exemplify how to use generators within scikit-multiflow
AGRAWALGenerator
Instantiate the Stream generator
>>> generator = AGRAWALGenerator()
Get data from the stream
Use next_sample() to obtain data (samples) from any Stream object. The Stream will return n_samples using two arrays: X for features and y for classes (classification) or targets (regression).
next_sample()
n_samples
X
y
>>> X, y = generator.next_sample() >>> print(X.shape, y.shape) (1, 9) (1,)
By default, next_sample() returns one sample, but we can pass an arbitrary number of samples as next_sample(n_samples). For example, to get 1000 samples:
next_sample(n_samples)
>>> X, y = generator.next_sample(1000) >>> print(X.shape, y.shape) (1000, 9) (1000,)
Check if the stream has more data
When working with streams, it is important to know if there is more data remaining. You can use has_more_samples() to query the Stream for this information.
has_more_samples()
>>> generator.has_more_samples() True
Restart the stream
To restart a Stream object to its initial state, we can use restart()
restart()
>>> generator.restart()
Save the data into a csv file [Optional]
There might be cases where we want to store the information obtained from a Stream generator. An easy way to do it is using numpy and pandas. First, we concatenate the X and y arrays into a single np.array. Then we create a DataFrame that is easy manipulate, for example if we want to name the features, pre-process the data, etc.
numpy
pandas
np.array
DataFrame
>>> df = pd.DataFrame(np.hstack((X,np.array([y]).T)))
Finally, to write the data into a csv:
>>> df.to_csv("file.csv")
Putting it all together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
>>> from skmultiflow.data import AGRAWALGenerator >>> import pandas as pd >>> import numpy as np >>> >>> # 1. Instantiate the stream generator >>> generator = AGRAWALGenerator() >>> >>> # 2. Get data from the stream >>> X, y = generator.next_sample() >>> print(X.shape, y.shape) >>> >>> (1, 9) (1,) >>> >>> X, y = generator.next_sample(1000) >>> print(X.shape, y.shape) >>> >>> (1000, 9) (1000,) >>> >>> # 3. Check if the stream has more data >>> generator.has_more_samples() >>> >>> True >>> >>> # 4. Restart the stream >>> generator.restart() >>> >>> # 5. Save data into a csv file [Optional] >>> df = pd.DataFrame(np.hstack((X,np.array([y]).T)))