Datasets¶

Datasets are the primary mechanism by which Pytorch assembles training and testing data to be used while training neural networks. While pytorch already provides a number of handy datasets and torchvision further extends them to common academic sets, the implementations below provide some very powerful options for loading all kinds of data. We had to extend the default Pytorch implementation as by default it does not keep track of some useful metadata. That said, you can use our datasets in the normal fashion you’re used to with Pytorch.

BaseDataset¶

class pywick.datasets.BaseDataset.BaseDataset[source]

Bases: object

An abstract class representing a Dataset.

All other datasets should subclass it. All subclasses should override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

fit_transforms()[source]

Make a single pass through the entire dataset in order to fit any parameters of the transforms which require the entire dataset. e.g. StandardScaler() requires mean and std for the entire dataset.

If you dont call this fit function, then transforms which require properties of the entire dataset will just work at the batch level. e.g. StandardScaler() will normalize each batch by the specific batch mean/std

load(num_samples=None, load_range=None)[source]

Load all data or a subset of the data into actual memory. For instance, if the inputs are paths to image files, then this function will actually load those images.

Parameters: num_samples – (int (optional)): number of samples to load. if None, will load all load_range – (numpy array of integers (optional)): the index range of images to load e.g. np.arange(4) loads the first 4 inputs+targets

CSVDataset¶

class pywick.datasets.CSVDataset.CSVDataset(csv, input_cols=None, target_cols=None, input_transform=None, target_transform=None, co_transform=None, apply_transforms_individually=False)[source]

Initialize a Dataset from a CSV file/dataframe. This does NOT actually load the data into memory if the csv parameter contains filepaths.

Parameters: csv – (string or pandas.DataFrame): if string, should be a path to a .csv file which can be loaded as a pandas dataframe input_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names target_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names input_transform – (transform): tranform to apply to inputs during runtime loading target_tranform – (transform): transform to apply to targets during runtime loading co_transform – (transform): transform to apply to both inputs and targets simultaneously during runtime loading apply_transforms_individually – (bool): Whether to apply transforms to individual inputs or to an input row as a whole (default: False)
copy(df=None)[source]

Creates a copy of itself (including transforms and other params).

Parameters: df – dataframe to include in the copy. If not specified, uses the internal dataframe inside this instance (if any)
split_by_column(col)[source]

Split this dataset object into multiple dataset objects based on the unique factors of the given column. The number of returned datasets will be equal to the number of unique values in the given column. The transforms and original dataframe will all be transferred to the new datasets

Useful for splitting a dataset into train/val/test datasets.

Parameters: col – (integer or string) which column to split the data on. if int, should be column index. if str, should be column name list of new datasets with transforms copied
train_test_split(train_size)[source]

Define a split for the current dataset where some part of it is used for training while the remainder is used for testing

Parameters: train_size – (int): length of the training dataset. The remainder will be returned as the test dataset tuple of datasets (train, test)

ClonedFolderDataset¶

class pywick.datasets.ClonedFolderDataset.ClonedFolderDataset(data, meta_data, **kwargs)[source]

Dataset that can be initialized with a dictionary of internal parameters (useful when trying to clone a FolderDataset)

Parameters: data – (list): list of data on which the dataset operates meta_data – (dict): parameters that correspond to the target dataset’s attributes kwargs – (args): variable set of key-value pairs to set as attributes for the dataset
pywick.datasets.ClonedFolderDataset.random_split_dataset(orig_dataset, splitRatio=0.8, random_seed=None)[source]

Randomly split the given dataset into two datasets based on the provided ratio

Parameters: orig_dataset – (UsefulDataset): dataset to split (of type pywick.datasets.UsefulDataset) splitRatio – (float): ratio to use when splitting the data random_seed – (int): random seed for replicability of results tuple of split ClonedFolderDatasets

FolderDataset¶

class pywick.datasets.FolderDataset.FolderDataset(root, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]

First, the relevant directory structures are traversed to find all necessary files.

Then provided loader(s) is/(are) invoked on inputs and targets.

Finally provided transforms are applied with optional ability to specify the order of individual and co-transforms.

The rel_target_root parameter is used for image segmentation cases

Typically the structure will look like the following:

|- root (aka training images)

- dir1
- dir2

- dir1
- dir2
Parameters: root – (string): path to main directory class_mode – (string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return label = return class folder as target image = return another image as target (determined by optional target_prefix/postfix). NOTE: if class_mode == ‘image’, in addition to input, you must also provide rel_target_root, target_prefix or target_postfix (in any combination). path = determines paths for inputs and targets and applies the respective loaders to the path class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure. input_regex – (string (default is any valid image file)): regular expression to find input images. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’ rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg) target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target target_postfix – (string): postfix to use (if any) when trying to locate the matching target transform – (torch transform): transform to apply to input sample individually target_transform – (torch transform): transform to apply to target sample individually co_transform – (torch transform): transform to apply to both the input and the target apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before) default_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined. if a function is provided, it should take in a file path as input and return the loaded sample. target_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file. If a function is provided, it should take in a file path as input and return the loaded sample. exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter target_index_map – (dict (defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized. Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None
getdata()[source]

Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!

Returns: iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)
getmeta_data()[source]

Additional data to return that might be useful to consumer. Typically a dict.

Returns: dict(any)

MultiFolderDataset¶

class pywick.datasets.MultiFolderDataset.MultiFolderDataset(roots, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]

This class extends the FolderDataset with abilty to supply multiple root directories. The rel_target_root must exist relative to each root directory. For complete description of functionality see FolderDataset

Parameters: roots – (list): list of root directories to traverse class_mode – (string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return label = return class folder as target image = return another image as target (determined by optional target_prefix/postfix) NOTE: if class_mode == ‘image’, in addition to input, you must also provide rel_target_root, target_prefix or target_postfix (in any combination). path = determines paths for inputs and targets and applies the respective loaders to the path class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure. input_regex – (string (default is any valid image file)): regular expression to find input images e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’ rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg) target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target target_postfix – (string): postfix to use (if any) when trying to locate the matching target transform – (torch transform): transform to apply to input sample individually target_transform – (torch transform): transform to apply to target sample individually co_transform – (torch transform): transform to apply to both the input and the target apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before) default_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined. if a function is provided, it should take in a file path as input and return the loaded sample. target_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file if a function is provided, it should take in a file path as input and return the loaded sample. exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter target_index_map – (dict (defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized. Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None

PredictFolderDataset¶

class pywick.datasets.PredictFolderDataset.PredictFolderDataset(root, input_regex='*', input_transform=None, input_loader=<function <lambda>>, target_loader=None, exclusion_file=None)[source]

If not transformed in any way (either via one of the loaders or transforms) the inputs and targets will be identical (paths to the discovered files)

Instead, the intended use is that the input path is loaded into some kind of binary representation (usually an image), while the target is either left as a path or is post-processed to accommodate some special need.

Parameters: root – (string): path to main directory input_regex – (string (default is any valid image file)): regular expression to find inputs. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’ input_transform – (torch transform): transform to apply to each input before returning input_loader – (callable (default: identity)): defines how to load input samples from file. If a function is provided, it should take in a file path as input and return the loaded sample. Identity simply returns the input. target_loader – (callable (default: None)): defines how to load target samples from file (which, in our case, are the same as inputs) If a function is provided, it should take in a file path as input and return the loaded sample. exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter

TensorDataset¶

class pywick.datasets.TensorDataset.TensorDataset(inputs, targets=None, input_transform=None, target_transform=None, co_transform=None)[source]

Parameters: inputs – (numpy array) targets – (numpy array) input_transform – (transform): transform to apply to input sample individually target_transform – (transform): transform to apply to target sample individually co_transform – (transform): transform to apply to both input and target sample simultaneously

UsefulDataset¶

class pywick.datasets.UsefulDataset.UsefulDataset[source]

Bases: sphinx.ext.autodoc.importer._MockObject

A torch.utils.data.Dataset class with additional useful functions.

getdata()[source]

Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!

Returns: iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)
getmeta_data()[source]

Additional data to return that might be useful to consumer. Typically a dict.

Returns: dict(any)

data utilities¶

pywick.datasets.data_utils.adjust_dset_length(dataset, num_batches: int, num_devices: int, batch_size: int)[source]
To properly distribute computation across devices (typically GPUs) we need to meet two criteria:
1. batch size on each device must be > 1
2. dataset must be evenly partitioned across devices in specified batches
Parameters: dataset – Dataset to trim num_batches – Number of batches that dset will be partitioned into num_devices – Number of devices dset will be distributed onto batch_size – Size of individual batch
pywick.datasets.data_utils.get_dataset_mean_std(data_set, img_size=256, output_div=255.0)[source]

Computes channel-wise mean and std of the dataset. The process is memory-intensive as the entire dataset must fit into memory. Therefore, each image is scaled down to img_size first (default: 256).

Assumptions:
1. dataset uses PIL to read images
2. Images are in RGB format.
Parameters: data_set – (pytorch Dataset) img_size – (int): scale of images at which to compute mean/std (default: 256) output_div – (float {1.0, 255.0}): Image values are naturally in 0-255 value range so the returned output is divided by output_div. For example, if output_div = 255.0 then mean/std will be in 0-1 range. (mean, std) as per-channel values ([r,g,b], [r,g,b])
pywick.datasets.data_utils.npy_loader(path, color_space=None)[source]

Convenience loader for numeric files (e.g. arrays of numbers)

pywick.datasets.data_utils.pil_loader(path, color_space='')[source]

Attempts to load a file using PIL with provided color_space.

Parameters: path – (string): file to load color_space – (string, one of {rgb, rgba, L, 1, binary}): Specifies the colorspace to use for PIL loading. If not provided a simple Image.open(path) will be performed. PIL image
pywick.datasets.data_utils.pil_loader_bw(path)[source]

Convenience loader for B/W files (e.g. .png with only one color chanel)

pywick.datasets.data_utils.pil_loader_rgb`(path)[source]

Convenience loader for RGB files (e.g. .jpg)