Datasets¶

Datasets are the primary mechanism by which Pytorch assembles training and testing data to be used while training neural networks. While pytorch already provides a number of handy datasets and torchvision further extends them to common academic sets, the implementations below provide some very powerful options for loading all kinds of data. We had to extend the default Pytorch implementation as by default it does not keep track of some useful metadata. That said, you can use our datasets in the normal fashion you’re used to with Pytorch.

BaseDataset¶

class pywick.datasets.BaseDataset.BaseDataset[source]¶

Bases: object

An abstract class representing a Dataset.

All other datasets should subclass it. All subclasses should override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

fit_transforms()[source]¶

Make a single pass through the entire dataset in order to fit any parameters of the transforms which require the entire dataset. e.g. StandardScaler() requires mean and std for the entire dataset.

If you dont call this fit function, then transforms which require properties of the entire dataset will just work at the batch level. e.g. StandardScaler() will normalize each batch by the specific batch mean/std

load(num_samples=None, load_range=None)[source]¶

Load all data or a subset of the data into actual memory. For instance, if the inputs are paths to image files, then this function will actually load those images.

Parameters:	num_samples – (int (optional)): number of samples to load. if None, will load all load_range – (numpy array of integers (optional)): the index range of images to load e.g. np.arange(4) loads the first 4 inputs+targets

CSVDataset¶

class pywick.datasets.CSVDataset.CSVDataset(csv, input_cols=None, target_cols=None, input_transform=None, target_transform=None, co_transform=None, apply_transforms_individually=False)[source]¶

Bases: pywick.datasets.BaseDataset.BaseDataset

Initialize a Dataset from a CSV file/dataframe. This does NOT actually load the data into memory if the csv parameter contains filepaths.

Parameters:

csv – (string or pandas.DataFrame): if string, should be a path to a .csv file which can be loaded as a pandas dataframe
input_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names
target_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names
input_transform – (transform): tranform to apply to inputs during runtime loading
target_tranform – (transform): transform to apply to targets during runtime loading
co_transform – (transform): transform to apply to both inputs and targets simultaneously during runtime loading
apply_transforms_individually – (bool): Whether to apply transforms to individual inputs or to an input row as a whole (default: False)

copy(df=None)[source]¶

Creates a copy of itself (including transforms and other params).

Parameters:	df – dataframe to include in the copy. If not specified, uses the internal dataframe inside this instance (if any)
Returns:

split_by_column(col)[source]¶

Split this dataset object into multiple dataset objects based on the unique factors of the given column. The number of returned datasets will be equal to the number of unique values in the given column. The transforms and original dataframe will all be transferred to the new datasets

Useful for splitting a dataset into train/val/test datasets.

Parameters:	col – (integer or string) which column to split the data on. if int, should be column index. if str, should be column name
Returns:	list of new datasets with transforms copied

train_test_split(train_size)[source]¶

Define a split for the current dataset where some part of it is used for training while the remainder is used for testing

Parameters:	train_size – (int): length of the training dataset. The remainder will be returned as the test dataset
Returns:	tuple of datasets (train, test)

ClonedFolderDataset¶

class pywick.datasets.ClonedFolderDataset.ClonedFolderDataset(data, meta_data, **kwargs)[source]¶

Bases: pywick.datasets.FolderDataset.FolderDataset

Dataset that can be initialized with a dictionary of internal parameters (useful when trying to clone a FolderDataset)

Parameters:	data – (list): list of data on which the dataset operates meta_data – (dict): parameters that correspond to the target dataset’s attributes kwargs – (args): variable set of key-value pairs to set as attributes for the dataset

pywick.datasets.ClonedFolderDataset.random_split_dataset(orig_dataset, splitRatio=0.8, random_seed=None)[source]¶

Randomly split the given dataset into two datasets based on the provided ratio

Parameters:	orig_dataset – (UsefulDataset): dataset to split (of type pywick.datasets.UsefulDataset) splitRatio – (float): ratio to use when splitting the data random_seed – (int): random seed for replicability of results
Returns:	tuple of split ClonedFolderDatasets

FolderDataset¶

class pywick.datasets.FolderDataset.FolderDataset(root, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]¶

Bases: pywick.datasets.UsefulDataset.UsefulDataset

An incredibly versatile dataset class for loading out-of-memory data.

First, the relevant directory structures are traversed to find all necessary files.

Then provided loader(s) is/(are) invoked on inputs and targets.

Finally provided transforms are applied with optional ability to specify the order of individual and co-transforms.

The rel_target_root parameter is used for image segmentation cases

Typically the structure will look like the following:

|- root (aka training images)

- dir1

- dir2

|- masks (aka label images)

- dir1

- dir2

Parameters:

root – (string): path to main directory
class_mode –
(string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return

label = return class folder as target

image = return another image as target (determined by optional target_prefix/postfix). NOTE: if class_mode == ‘image’, in addition to input, you must also provide rel_target_root, target_prefix or target_postfix (in any combination).

path = determines paths for inputs and targets and applies the respective loaders to the path
class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure.
input_regex – (string (default is any valid image file)): regular expression to find input images. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg)
target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target
target_postfix – (string): postfix to use (if any) when trying to locate the matching target
transform – (torch transform): transform to apply to input sample individually
target_transform – (torch transform): transform to apply to target sample individually
co_transform – (torch transform): transform to apply to both the input and the target
apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before)
default_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined. if a function is provided, it should take in a file path as input and return the loaded sample.
target_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file. If a function is provided, it should take in a file path as input and return the loaded sample.
exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter
target_index_map –
(dict (defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized.

Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None

getdata()[source]¶

Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!

Returns:	iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)

getmeta_data()[source]¶

Additional data to return that might be useful to consumer. Typically a dict.

Returns:	dict(any)

MultiFolderDataset¶

class pywick.datasets.MultiFolderDataset.MultiFolderDataset(roots, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]¶

Bases: pywick.datasets.FolderDataset.FolderDataset

This class extends the FolderDataset with abilty to supply multiple root directories. The rel_target_root must exist relative to each root directory. For complete description of functionality see FolderDataset

Parameters:

roots – (list): list of root directories to traverse
class_mode –
(string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return

label = return class folder as target

image = return another image as target (determined by optional target_prefix/postfix)

NOTE: if class_mode == ‘image’, in addition to input, you must also provide rel_target_root, target_prefix or target_postfix (in any combination).

path = determines paths for inputs and targets and applies the respective loaders to the path
class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure.
input_regex –
(string (default is any valid image file)): regular expression to find input images

e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg)
target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target
target_postfix – (string): postfix to use (if any) when trying to locate the matching target
transform – (torch transform): transform to apply to input sample individually
target_transform – (torch transform): transform to apply to target sample individually
co_transform – (torch transform): transform to apply to both the input and the target
apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before)
default_loader –
(string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined.

if a function is provided, it should take in a file path as input and return the loaded sample.
target_loader –
(string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file

if a function is provided, it should take in a file path as input and return the loaded sample.
exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter
target_index_map –
(dict `(defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized.

Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None

PredictFolderDataset¶

class pywick.datasets.PredictFolderDataset.PredictFolderDataset(root, input_regex='*', input_transform=None, input_loader=<function <lambda>>, target_loader=None, exclusion_file=None)[source]¶

Bases: pywick.datasets.FolderDataset.FolderDataset

Convenience class for loading out-of-memory data that is more geared toward prediction data loading (where ground truth is not available).

If not transformed in any way (either via one of the loaders or transforms) the inputs and targets will be identical (paths to the discovered files)

Instead, the intended use is that the input path is loaded into some kind of binary representation (usually an image), while the target is either left as a path or is post-processed to accommodate some special need.

Parameters:

root – (string): path to main directory
input_regex – (string (default is any valid image file)): regular expression to find inputs. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
input_transform – (torch transform): transform to apply to each input before returning
input_loader – (callable (default: identity)): defines how to load input samples from file. If a function is provided, it should take in a file path as input and return the loaded sample. Identity simply returns the input.
target_loader – (callable (default: None)): defines how to load target samples from file (which, in our case, are the same as inputs) If a function is provided, it should take in a file path as input and return the loaded sample.
exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter

TensorDataset¶

class pywick.datasets.TensorDataset.TensorDataset(inputs, targets=None, input_transform=None, target_transform=None, co_transform=None)[source]¶

Bases: pywick.datasets.BaseDataset.BaseDataset

Dataset class for loading in-memory data.

Parameters:	inputs – (numpy array) targets – (numpy array) input_transform – (transform): transform to apply to input sample individually target_transform – (transform): transform to apply to target sample individually co_transform – (transform): transform to apply to both input and target sample simultaneously

UsefulDataset¶

class pywick.datasets.UsefulDataset.UsefulDataset[source]¶

Bases: sphinx.ext.autodoc.importer._MockObject

A torch.utils.data.Dataset class with additional useful functions.

getdata()[source]¶

Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!

Returns:	iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)

getmeta_data()[source]¶

Additional data to return that might be useful to consumer. Typically a dict.

Returns:	dict(any)

data utilities¶

pywick.datasets.data_utils.adjust_dset_length(dataset, num_batches: int, num_devices: int, batch_size: int)[source]¶

To properly distribute computation across devices (typically GPUs) we need to meet two criteria:

batch size on each device must be > 1
dataset must be evenly partitioned across devices in specified batches

Parameters:	dataset – Dataset to trim num_batches – Number of batches that dset will be partitioned into num_devices – Number of devices dset will be distributed onto batch_size – Size of individual batch
Returns:

pywick.datasets.data_utils.get_dataset_mean_std(data_set, img_size=256, output_div=255.0)[source]¶

Computes channel-wise mean and std of the dataset. The process is memory-intensive as the entire dataset must fit into memory. Therefore, each image is scaled down to img_size first (default: 256).

Assumptions:

dataset uses PIL to read images
Images are in RGB format.

Parameters:	data_set – (pytorch Dataset) img_size – (int): scale of images at which to compute mean/std (default: 256) output_div – (float {1.0, 255.0}): Image values are naturally in 0-255 value range so the returned output is divided by output_div. For example, if output_div = 255.0 then mean/std will be in 0-1 range.
Returns:	(mean, std) as per-channel values ([r,g,b], [r,g,b])

pywick.datasets.data_utils.npy_loader(path, color_space=None)[source]¶: Convenience loader for numeric files (e.g. arrays of numbers)

pywick.datasets.data_utils.pil_loader(path, color_space='')[source]¶

Attempts to load a file using PIL with provided color_space.

Parameters:	path – (string): file to load color_space – (string, one of {rgb, rgba, L, 1, binary}): Specifies the colorspace to use for PIL loading. If not provided a simple `Image.open(path)` will be performed.
Returns:	PIL image

pywick.datasets.data_utils.pil_loader_bw(path)[source]¶: Convenience loader for B/W files (e.g. .png with only one color chanel)

pywick.datasets.data_utils.pil_loader_rgb(path)[source]¶: Convenience loader for RGB files (e.g. .jpg)