Datasets¶
Datasets are the primary mechanism by which Pytorch assembles training and testing data to be used while training neural networks. While pytorch already provides a number of handy datasets and torchvision further extends them to common academic sets, the implementations below provide some very powerful options for loading all kinds of data. We had to extend the default Pytorch implementation as by default it does not keep track of some useful metadata. That said, you can use our datasets in the normal fashion you’re used to with Pytorch.
BaseDataset¶
-
class
pywick.datasets.BaseDataset.
BaseDataset
[source]¶ Bases:
object
An abstract class representing a Dataset.
All other datasets should subclass it. All subclasses should override
__len__
, that provides the size of the dataset, and__getitem__
, supporting integer indexing in range from 0 to len(self) exclusive.-
fit_transforms
()[source]¶ Make a single pass through the entire dataset in order to fit any parameters of the transforms which require the entire dataset. e.g. StandardScaler() requires mean and std for the entire dataset.
If you dont call this fit function, then transforms which require properties of the entire dataset will just work at the batch level. e.g. StandardScaler() will normalize each batch by the specific batch mean/std
-
load
(num_samples=None, load_range=None)[source]¶ Load all data or a subset of the data into actual memory. For instance, if the inputs are paths to image files, then this function will actually load those images.
Parameters: - num_samples – (int (optional)): number of samples to load. if None, will load all
- load_range – (numpy array of integers (optional)): the index range of images to load e.g. np.arange(4) loads the first 4 inputs+targets
-
CSVDataset¶
-
class
pywick.datasets.CSVDataset.
CSVDataset
(csv, input_cols=None, target_cols=None, input_transform=None, target_transform=None, co_transform=None, apply_transforms_individually=False)[source]¶ Bases:
pywick.datasets.BaseDataset.BaseDataset
Initialize a Dataset from a CSV file/dataframe. This does NOT actually load the data into memory if the
csv
parameter contains filepaths.Parameters: - csv – (string or pandas.DataFrame): if string, should be a path to a .csv file which can be loaded as a pandas dataframe
- input_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names
- target_cols – (list of ints, or list of strings): which column(s) to use as input arrays. If int(s), should be column indicies. If str(s), should be column names
- input_transform – (transform): tranform to apply to inputs during runtime loading
- target_tranform – (transform): transform to apply to targets during runtime loading
- co_transform – (transform): transform to apply to both inputs and targets simultaneously during runtime loading
- apply_transforms_individually – (bool): Whether to apply transforms to individual inputs or to an input row as a whole (default: False)
-
copy
(df=None)[source]¶ Creates a copy of itself (including transforms and other params).
Parameters: df – dataframe to include in the copy. If not specified, uses the internal dataframe inside this instance (if any) Returns:
-
split_by_column
(col)[source]¶ Split this dataset object into multiple dataset objects based on the unique factors of the given column. The number of returned datasets will be equal to the number of unique values in the given column. The transforms and original dataframe will all be transferred to the new datasets
Useful for splitting a dataset into train/val/test datasets.
Parameters: col – (integer or string) which column to split the data on. if int, should be column index. if str, should be column name Returns: list of new datasets with transforms copied
-
train_test_split
(train_size)[source]¶ Define a split for the current dataset where some part of it is used for training while the remainder is used for testing
Parameters: train_size – (int): length of the training dataset. The remainder will be returned as the test dataset Returns: tuple of datasets (train, test)
ClonedFolderDataset¶
-
class
pywick.datasets.ClonedFolderDataset.
ClonedFolderDataset
(data, meta_data, **kwargs)[source]¶ Bases:
pywick.datasets.FolderDataset.FolderDataset
Dataset that can be initialized with a dictionary of internal parameters (useful when trying to clone a FolderDataset)
Parameters: - data – (list): list of data on which the dataset operates
- meta_data – (dict): parameters that correspond to the target dataset’s attributes
- kwargs – (args): variable set of key-value pairs to set as attributes for the dataset
-
pywick.datasets.ClonedFolderDataset.
random_split_dataset
(orig_dataset, splitRatio=0.8, random_seed=None)[source]¶ Randomly split the given dataset into two datasets based on the provided ratio
Parameters: - orig_dataset – (UsefulDataset): dataset to split (of type pywick.datasets.UsefulDataset)
- splitRatio – (float): ratio to use when splitting the data
- random_seed – (int): random seed for replicability of results
Returns: tuple of split ClonedFolderDatasets
FolderDataset¶
-
class
pywick.datasets.FolderDataset.
FolderDataset
(root, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]¶ Bases:
pywick.datasets.UsefulDataset.UsefulDataset
An incredibly versatile dataset class for loading out-of-memory data.
First, the relevant directory structures are traversed to find all necessary files.
Then provided loader(s) is/(are) invoked on inputs and targets.
Finally provided transforms are applied with optional ability to specify the order of individual and co-transforms.
- The rel_target_root parameter is used for image segmentation cases
Typically the structure will look like the following:
|- root (aka training images)
- dir1- dir2|- masks (aka label images)
- dir1- dir2
Parameters: - root – (string): path to main directory
- class_mode –
(string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return
label = return class folder as target
image = return another image as target (determined by optional target_prefix/postfix). NOTE: if class_mode == ‘image’, in addition to input, you must also provide
rel_target_root
,target_prefix
ortarget_postfix
(in any combination).path = determines paths for inputs and targets and applies the respective loaders to the path
- class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure.
- input_regex – (string (default is any valid image file)): regular expression to find input images. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
- rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg)
- target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target
- target_postfix – (string): postfix to use (if any) when trying to locate the matching target
- transform – (torch transform): transform to apply to input sample individually
- target_transform – (torch transform): transform to apply to target sample individually
- co_transform – (torch transform): transform to apply to both the input and the target
- apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before)
- default_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined. if a function is provided, it should take in a file path as input and return the loaded sample.
- target_loader – (string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file. If a function is provided, it should take in a file path as input and return the loaded sample.
- exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter
- target_index_map –
(dict (defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized.
Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None
-
getdata
()[source]¶ Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!
Returns: iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)
MultiFolderDataset¶
-
class
pywick.datasets.MultiFolderDataset.
MultiFolderDataset
(roots, class_mode='label', class_to_idx=None, input_regex='*', rel_target_root='', target_prefix='', target_postfix='', target_extension='png', transform=None, target_transform=None, co_transform=None, apply_co_transform_first=True, default_loader='pil', target_loader=None, exclusion_file=None, target_index_map=None)[source]¶ Bases:
pywick.datasets.FolderDataset.FolderDataset
This class extends the FolderDataset with abilty to supply multiple root directories. The
rel_target_root
must exist relative to each root directory. For complete description of functionality seeFolderDataset
Parameters: - roots – (list): list of root directories to traverse
- class_mode –
(string in {‘label’, ‘image’, ‘path’}): type of target sample to look for and return
label = return class folder as target
image = return another image as target (determined by optional target_prefix/postfix)
NOTE: if class_mode == ‘image’, in addition to input, you must also provide rel_target_root, target_prefix or target_postfix (in any combination).path = determines paths for inputs and targets and applies the respective loaders to the path
- class_to_idx – (dict): If specified, the given class_to_idx map will be used. Otherwise one will be derived from the directory structure.
- input_regex –
(string (default is any valid image file)): regular expression to find input images
e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
- rel_target_root – (string (default is Nothing)): root of directory where to look for target images RELATIVE to the root dir (first arg)
- target_prefix – (string (default is Nothing)): prefix to use (if any) when trying to locate the matching target
- target_postfix – (string): postfix to use (if any) when trying to locate the matching target
- transform – (torch transform): transform to apply to input sample individually
- target_transform – (torch transform): transform to apply to target sample individually
- co_transform – (torch transform): transform to apply to both the input and the target
- apply_co_transform_first – (bool): whether to apply the co-transform before or after individual transforms (default: True = before)
- default_loader –
(string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load samples from file. Will be applied to both input and target unless a separate target_loader is defined.
if a function is provided, it should take in a file path as input and return the loaded sample.
- target_loader –
(string in {‘npy’, ‘pil’} or function (default: pil)): defines how to load target samples from file
if a function is provided, it should take in a file path as input and return the loaded sample.
- exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter
- target_index_map –
(dict `(defaults to binary mask: {255:1})): a dictionary that maps pixel values in the image to classes to be recognized.
Used in conjunction with ‘image’ class_mode to produce a label for semantic segmentation For semantic segmentation this is required so the default is a binary mask. However, if you want to turn off this feature then specify target_index_map=None
PredictFolderDataset¶
-
class
pywick.datasets.PredictFolderDataset.
PredictFolderDataset
(root, input_regex='*', input_transform=None, input_loader=<function <lambda>>, target_loader=None, exclusion_file=None)[source]¶ Bases:
pywick.datasets.FolderDataset.FolderDataset
Convenience class for loading out-of-memory data that is more geared toward prediction data loading (where ground truth is not available).
If not transformed in any way (either via one of the loaders or transforms) the inputs and targets will be identical (paths to the discovered files)
Instead, the intended use is that the input path is loaded into some kind of binary representation (usually an image), while the target is either left as a path or is post-processed to accommodate some special need.
Parameters: - root – (string): path to main directory
- input_regex – (string (default is any valid image file)): regular expression to find inputs. e.g. if all your inputs have the word ‘input’, you’d enter something like input_regex=’input’
- input_transform – (torch transform): transform to apply to each input before returning
- input_loader – (callable (default: identity)): defines how to load input samples from file. If a function is provided, it should take in a file path as input and return the loaded sample. Identity simply returns the input.
- target_loader – (callable (default: None)): defines how to load target samples from file (which, in our case, are the same as inputs) If a function is provided, it should take in a file path as input and return the loaded sample.
- exclusion_file – (string): list of files to exclude when enumerating all files. The list must be a full path relative to the root parameter
TensorDataset¶
-
class
pywick.datasets.TensorDataset.
TensorDataset
(inputs, targets=None, input_transform=None, target_transform=None, co_transform=None)[source]¶ Bases:
pywick.datasets.BaseDataset.BaseDataset
Dataset class for loading in-memory data.
Parameters: - inputs – (numpy array)
- targets – (numpy array)
- input_transform – (transform): transform to apply to input sample individually
- target_transform – (transform): transform to apply to target sample individually
- co_transform – (transform): transform to apply to both input and target sample simultaneously
UsefulDataset¶
-
class
pywick.datasets.UsefulDataset.
UsefulDataset
[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
A
torch.utils.data.Dataset
class with additional useful functions.-
getdata
()[source]¶ Data that the Dataset class operates on. Typically iterable/list of tuple(label,target). Note: This is different than simply calling myDataset.data because some datasets are comprised of multiple other datasets! The dataset returned should be the combined dataset!
Returns: iterable - Representation of the entire dataset (combined if necessary from multiple other datasets)
-
data utilities¶
-
pywick.datasets.data_utils.
adjust_dset_length
(dataset, num_batches: int, num_devices: int, batch_size: int)[source]¶ - To properly distribute computation across devices (typically GPUs) we need to meet two criteria:
- batch size on each device must be > 1
- dataset must be evenly partitioned across devices in specified batches
Parameters: - dataset – Dataset to trim
- num_batches – Number of batches that dset will be partitioned into
- num_devices – Number of devices dset will be distributed onto
- batch_size – Size of individual batch
Returns:
-
pywick.datasets.data_utils.
get_dataset_mean_std
(data_set, img_size=256, output_div=255.0)[source]¶ Computes channel-wise mean and std of the dataset. The process is memory-intensive as the entire dataset must fit into memory. Therefore, each image is scaled down to img_size first (default: 256).
- Assumptions:
- dataset uses PIL to read images
- Images are in RGB format.
Parameters: - data_set – (pytorch Dataset)
- img_size – (int): scale of images at which to compute mean/std (default: 256)
- output_div – (float {1.0, 255.0}): Image values are naturally in 0-255 value range so the returned output is divided by output_div. For example, if output_div = 255.0 then mean/std will be in 0-1 range.
Returns: (mean, std) as per-channel values ([r,g,b], [r,g,b])
-
pywick.datasets.data_utils.
npy_loader
(path, color_space=None)[source]¶ Convenience loader for numeric files (e.g. arrays of numbers)
-
pywick.datasets.data_utils.
pil_loader
(path, color_space='')[source]¶ Attempts to load a file using PIL with provided
color_space
.Parameters: - path – (string): file to load
- color_space – (string, one of {rgb, rgba, L, 1, binary}): Specifies the colorspace
to use for PIL loading. If not provided a simple
Image.open(path)
will be performed.
Returns: PIL image