Introduction¶

streamscl is a Python package to generate streams of publicly available data. 9 static datasets (with real-world OOD or subpopulation data) are used to create streamscl, a benchmark of 9 data streams that cover regression, classification, and generation tasks across many modalities: images, text, speech, time series, tabular data, and human-agent interactions.

Motivation¶

ML practitioners in the wild often face a constant stream of data with an ever-changing distribution. To maintain performance, they must adapt to the data stream, a problem called online continual learning (OCL). Despite its practical relevance, existing benchmarks for OCL suffer from many issues: (1) distribution shifts are abrupt and known beforehand, while in the real world, they can also be gradual and arrive without forewarning; (2) shifts are synthetic and unrealistic (e.g., pixel permutations); (3) benchmarks only cover a single modality and task type (typically few-class image classification). To address these issues, we introduce a new multimodal benchmark for OCL called streamscl. Given the scarcity of publicly available data streams, and the potential infrequency of adverse shifts that are worth simulating, we first propose a method to controllably generate streaming data from static data. Then taking static datasets containing real out-of-domain data (e.g., IWildCam) or multiple subpopulations (e.g., CivilComments), we apply this approach across a variety of modalities – images, text, speech, time series, tabular data, and human-agent interactions – to create streamscl.

Usage¶

The following datasets are supported:

iwildcam
civilcomments
poverty
jeopardy
airquality
zillow
coauthor
census
nuimages

Data storage¶

The default data storage folder is ~/.streams_data. You can override it by setting the environment variables DOWNLOAD_HOME and DOWNLOAD_PREFIX. If you do not already have the data downloaded, the STREAMSDataset utilities will do so for you – with the exception of nuimages, which requires manual download.

To download nuimages, manually download the Metadata and Samples from NuImages. Extract them to a folder named nuimages in the ~/.streams_data or DOWNLOAD_HOME folder. Make sure the directory structure of nuimages looks like:

nuimages/
- samples/
- v1.0-test/
- v1.0-train/
- v1.0-val/

Dataset configuration¶

To use the STREAMSDataset class, you can either pass in your own stream configuration parameters or use a preset stream configuration. To use the preset configuration, you can do the following:

from streams import STREAMSDataset

dataset_name = "iwildcam"
ds = STREAMSDataset.from_config(dataset_name)

To pass in your own stream configuration parameters, you can do the following:

from streams import STREAMSDataset

dataset_name = "iwildcam"
ds = STREAMSDataset(
   dataset_name,
   T=10,
   gamma=0.5,
   num_peaks=5,
   start_max=10,
   duration=1,
   log_step=1,
   inference_window=1
)

See parameter descriptions in streams.utils.create_logits() for more details. The inference_window parameter tells how many steps you want to be able to “look ahead” in the stream as a “test set” to evaluate on.

Dataset iteration¶

Any instance of the STREAMSDataset class has a step property that tells you what timestep you are in the stream. Initially, the step is 0. To iterate through the dataset, you can call the following methods:

from streams import STREAMSDataset

ds = STREAMSDataset.from_config("iwildcam")

train_data, test_data = ds.get_data(include_test=True)
for step, (x, y) in enumerate(train_data):
   print(step, x, y)

# Or load the data into Pytorch data loaders
train_dl, test_dl = ds.get_loaders(batch_size=32, include_test=True)

# Advance time step in the stream
ds.advance(step_size=1)

# Reset to the beginning
ds.reset()

The STREAMSDataset class also has some helper methods:

from streams import STREAMSDataset

ds = STREAMSDataset.from_config("iwildcam")

# Visualize signals for how likely domain values are to occur
ds.visualize(domain_type_index=0, domain_value_indices=[0, 1, 2])

# Get data from specific points in the stream
ds.get(step_indices=[4, 5], future_ok=True)

# Get length of dataset
len(ds)

Check out streams.STREAMSDataset for more details.

Avalanche Integration¶

TODO(shreyashankar)

Training Example¶

TODO(shreyashankar)

Contributing¶

TODO(shreyashankar)

Introduction¶

Motivation¶

Usage¶

Data storage¶

Dataset configuration¶

Dataset iteration¶

Avalanche Integration¶

Training Example¶

Contributing¶

streamscl

Navigation

Related Topics