skmultiflow.data.FileStream

class skmultiflow.data.FileStream(filepath, target_idx=- 1, n_targets=1, cat_features=None, allow_nan=False)[source]

Creates a stream from a file source.

For the moment only csv files are supported, but the goal is to support different formats, as long as there is a function that correctly reads, interprets, and returns a pandas’ DataFrame or numpy.ndarray with the data.

Parameters
filepath:

Path to the data file

target_idx: int, optional (default=-1)

The column index from which the targets start.

n_targets: int, optional (default=1)

The number of targets.

cat_features: list, optional (default=None)

A list of indices corresponding to the location of categorical features.

allow_nan: bool, optional (default=False)

If True, allows NaN values in the data. Otherwise, an error is raised.

Notes

The stream object provides upon request a number of samples, in a way such that old samples cannot be accessed at a later time. This is done to correctly simulate the stream context.

Examples

>>> # Imports
>>> from skmultiflow.data.file_stream import FileStream
>>> # Setup the stream
>>> stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/"
...                     "streaming-datasets/master/sea_stream.csv")
>>> # Retrieving one sample
>>> stream.next_sample()
(array([[0.080429, 8.397187, 7.074928]]), array([0]))
>>> # Retrieving 10 samples
>>> stream.next_sample(10)
(array([[1.42074 , 7.504724, 6.764101],
    [0.960543, 5.168416, 8.298959],
    [3.367279, 6.797711, 4.857875],
    [9.265933, 8.548432, 2.460325],
    [7.295862, 2.373183, 3.427656],
    [9.289001, 3.280215, 3.154171],
    [0.279599, 7.340643, 3.729721],
    [4.387696, 1.97443 , 6.447183],
    [2.933823, 7.150514, 2.566901],
    [4.303049, 1.471813, 9.078151]]),
    array([0, 0, 1, 1, 1, 1, 0, 0, 1, 0]))
>>> stream.n_remaining_samples()
39989
>>> stream.has_more_samples()
True

Methods

get_all_samples(self)

returns all the samples in the stream.

get_data_info(self)

Retrieves minimum information from the stream

get_info(self)

Collects and returns the information about the configuration of the estimator

get_params(self[, deep])

Get parameters for this estimator.

get_target_values(self)

has_more_samples(self)

Checks if stream has more samples.

is_restartable(self)

Determine if the stream is restartable.

last_sample(self)

Retrieves last batch_size samples in the stream.

n_remaining_samples(self)

Returns the estimated number of remaining samples.

next_sample(self[, batch_size])

Returns next sample from the stream.

prepare_for_use()

Prepare the stream for use.

reset(self)

Resets the estimator to its initial state.

restart(self)

Restarts the stream.

set_params(self, **params)

Set the parameters of this estimator.

Attributes

cat_features_idx

Get the list of the categorical features index.

feature_names

Retrieve the names of the features.

n_cat_features

Retrieve the number of integer features.

n_features

Retrieve the number of features.

n_num_features

Retrieve the number of numerical features.

n_targets

Get the number of targets.

target_idx

Get the number of the column where Y begins.

target_names

Retrieve the names of the targets

target_values

Retrieve all target_values in the stream for each target.

property cat_features_idx

Get the list of the categorical features index.

Returns
list:

List of categorical features index.

property feature_names

Retrieve the names of the features.

Returns
list

names of the features

get_all_samples(self)[source]

returns all the samples in the stream.

Returns
X: pd.DataFrame

The features’ columns.

y: pd.DataFrame

The targets’ columns.

get_data_info(self)[source]

Retrieves minimum information from the stream

Used by evaluator methods to id the stream.

The default format is: ‘Stream name - n_targets, n_classes, n_features’.

Returns
string

Stream data information

get_info(self)[source]

Collects and returns the information about the configuration of the estimator

Returns
string

Configuration of the estimator.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

has_more_samples(self)[source]

Checks if stream has more samples.

Returns
Boolean

True if stream has more samples.

is_restartable(self)[source]

Determine if the stream is restartable.

Returns
Bool

True if stream is restartable.

last_sample(self)[source]

Retrieves last batch_size samples in the stream.

Returns
tuple or tuple list

A numpy.ndarray of shape (batch_size, n_features) and an array-like of shape (batch_size, n_targets), representing the next batch_size samples.

property n_cat_features

Retrieve the number of integer features.

Returns
int

The number of integer features in the stream.

property n_features

Retrieve the number of features.

Returns
int

The total number of features.

property n_num_features

Retrieve the number of numerical features.

Returns
int

The number of numerical features in the stream.

n_remaining_samples(self)[source]

Returns the estimated number of remaining samples.

Returns
int

Remaining number of samples.

property n_targets

Get the number of targets.

Returns
int:

The number of targets.

next_sample(self, batch_size=1)[source]

Returns next sample from the stream.

If there is enough instances to supply at least batch_size samples, those are returned. If there aren’t a tuple of (None, None) is returned.

Parameters
batch_size: int (optional, default=1)

The number of instances to return.

Returns
tuple or tuple list

Returns the next batch_size instances. For general purposes the return can be treated as a numpy.ndarray.

static prepare_for_use()[source]

Prepare the stream for use.

Deprecated in v0.5.0 and will be removed in v0.7.0

reset(self)[source]

Resets the estimator to its initial state.

Returns
self
restart(self)[source]

Restarts the stream.

It basically server the purpose of reinitializing the stream to its initial state.

set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self
property target_idx

Get the number of the column where Y begins.

Returns
int:

The number of the column where Y begins.

property target_names

Retrieve the names of the targets

Returns
list

the names of the targets in the stream.

property target_values

Retrieve all target_values in the stream for each target.

Returns
list

list of lists of all target_values for each target