Introduction

streamscl is a Python package to generate streams of publicly available data. 9 static datasets (with real-world OOD or subpopulation data) are used to create streamscl, a benchmark of 9 data streams that cover regression, classification, and generation tasks across many modalities: images, text, speech, time series, tabular data, and human-agent interactions.

Motivation

ML practitioners in the wild often face a constant stream of data with an ever-changing distribution. To maintain performance, they must adapt to the data stream, a problem called online continual learning (OCL). Despite its practical relevance, existing benchmarks for OCL suffer from many issues: (1) distribution shifts are abrupt and known beforehand, while in the real world, they can also be gradual and arrive without forewarning; (2) shifts are synthetic and unrealistic (e.g., pixel permutations); (3) benchmarks only cover a single modality and task type (typically few-class image classification). To address these issues, we introduce a new multimodal benchmark for OCL called streamscl. Given the scarcity of publicly available data streams, and the potential infrequency of adverse shifts that are worth simulating, we first propose a method to controllably generate streaming data from static data. Then taking static datasets containing real out-of-domain data (e.g., IWildCam) or multiple subpopulations (e.g., CivilComments), we apply this approach across a variety of modalities – images, text, speech, time series, tabular data, and human-agent interactions – to create streamscl.

Usage

The following datasets are supported:

  • iwildcam

  • civilcomments

  • poverty

  • jeopardy

  • airquality

  • zillow

  • coauthor

  • census

  • nuimages

Data storage

The default data storage folder is ~/.streams_data. You can override it by setting the environment variables DOWNLOAD_HOME and DOWNLOAD_PREFIX. If you do not already have the data downloaded, the STREAMSDataset utilities will do so for you – with the exception of nuimages, which requires manual download.

To download nuimages, manually download the Metadata and Samples from NuImages. Extract them to a folder named nuimages in the ~/.streams_data or DOWNLOAD_HOME folder. Make sure the directory structure of nuimages looks like:

  • nuimages/
    • samples/

    • v1.0-test/

    • v1.0-train/

    • v1.0-val/

Dataset configuration

To use the STREAMSDataset class, you can either pass in your own stream configuration parameters or use a preset stream configuration. To use the preset configuration, you can do the following:

from streams import STREAMSDataset

dataset_name = "iwildcam"
ds = STREAMSDataset.from_config(dataset_name)

To pass in your own stream configuration parameters, you can do the following:

from streams import STREAMSDataset

dataset_name = "iwildcam"
ds = STREAMSDataset(
   dataset_name,
   T=10,
   gamma=0.5,
   num_peaks=5,
   start_max=10,
   duration=1,
   log_step=1,
   inference_window=1
)

See parameter descriptions in streams.utils.create_logits() for more details. The inference_window parameter tells how many steps you want to be able to “look ahead” in the stream as a “test set” to evaluate on.

Dataset iteration

Any instance of the STREAMSDataset class has a step property that tells you what timestep you are in the stream. Initially, the step is 0. To iterate through the dataset, you can call the following methods:

from streams import STREAMSDataset

ds = STREAMSDataset.from_config("iwildcam")

train_data, test_data = ds.get_data(include_test=True)
for step, (x, y) in enumerate(train_data):
   print(step, x, y)

# Or load the data into Pytorch data loaders
train_dl, test_dl = ds.get_loaders(batch_size=32, include_test=True)

# Advance time step in the stream
ds.advance(step_size=1)

# Reset to the beginning
ds.reset()

The STREAMSDataset class also has some helper methods:

from streams import STREAMSDataset

ds = STREAMSDataset.from_config("iwildcam")

# Visualize signals for how likely domain values are to occur
ds.visualize(domain_type_index=0, domain_value_indices=[0, 1, 2])

# Get data from specific points in the stream
ds.get(step_indices=[4, 5], future_ok=True)

# Get length of dataset
len(ds)

Check out streams.STREAMSDataset for more details.

Avalanche Integration

TODO(shreyashankar)

Training Example

TODO(shreyashankar)

Contributing

TODO(shreyashankar)