(Intermediate) Datasets#
Datasets are the internal classes providing the individual patches for training, validation and prediction. In CAREamics, we provide a TrainDataModule
class that creates the datasets for training and validation (there is a class for prediction as well, which is simpler and shares some parameters with the training one). In most cases, it is created internally. In this section, we describe what it does and shed light on some of its parameters that are passed to the train methods.
Datasets in practice
This section contains descriptions of the internal working of CAREamics. In practice, most users will never have to instantiate the datasets themselves, as they are created from within the careamist.train
or careamist.predict
methods.
Overview#
The TrainDataModule
receives both data configuration and data itself. The data can be passed a path to a folder, to a file or as numpy
array.
import numpy as np
from careamics.config import create_n2v_configuration
from careamics.lightning import TrainDataModule
train_array = np.random.rand(128, 128)
config = create_n2v_configuration(
experiment_name="n2v_2D",
data_type="array",
axes="YX",
patch_size=[64, 64],
batch_size=1,
num_epochs=1,
)
data_module = TrainDataModule( # (1)!
data_config=config.data_config, train_data=train_array
)
It has the following parameters:
data_config
: data configurationtrain_data
: training data (array or path)(optional) val_data
: validation data, if not provided, the validation data is taken from the training data(optional) train_data_target
: target data for training (if applicable)(optional) val_data_target
: target data for validation (if applicable)(optional) read_source_func
: function to read custom data types (see custom data types)(optional) extension_filter
: filter to select custom types (see custom data types)(optional) val_percentage
: percentage of validation data to extract from the training (see splitting validation)(optional) val_minimum_split
: minimum validation split (see splitting validation)(optional) use_in_memory
: whether to use in-memory dataset if possible (Default isTrue
), not applicable to mnumpy arrays.
Depending on the type of the data, which is specified in the data_config
and is compared to the type of train_data
, the TrainDataModule
will create the appropriate dataset for both training and validation data.
In the absence of validation, validation data is extracted from training data (see splitting validation).
Available datasets#
CAREamics currently support two datasets:
- InMemoryDataset: used when the data fits in memory.
- IterableDataset: used when the data is too large to fit in memory.
If the data is a numpy
array, the InMemoryDataset
is used automatically. Otherwise, we list the files contained in the path, compute the size of the data and instantiate an InMemoryDataset
if the data is less than 80% of the total RAM size. If not, CAREamics instantiate an IterableDataset
.
Both datasets work differently, and the main differences can be summarized as follows:
Feature | InMemoryDataset | IterableDataset |
---|---|---|
Used with arrays | Yes | No |
Patch extraction | Sequential | Random |
Data loading | All in memory | One file at a time |
In the next sections, we describe the different steps they perform.
In-memory dataset#
As the name implies, the in-memory dataset loads all the data in memory. It is used when the data on the disk seems to fit in memory, or when the data is already in memory and passed as a numpy array. The advantage of the dataset is that is allows faster access to the patches, and therefore faster training time.
What about supervised training?
For supervised training, the steps are the same and are performed for the targets alongside the source.
What if I have a time (T
) axis?
T
axes are accepted by the CAREamics configuration, but are treated as a sample dimension (S
). If both S
and T
are present, the two axes are concatenated.
Iterable dataset#
The iterable dataset is used to load patches from a single file at a time, one file after another. This allows training on datasets that are too large to fit in memory. This dataset is exclusively used with files input (data passed as paths).
Iterable dataset and splitting validation
The iterable dataset does not split patches from the training data, but files! (see splitting validation).
What about supervised training?
For supervised training, the steps are the same and are performed for the targets alongside the source.
What if I have a time (T
) axis?
T
axes are accepted by the CAREamics configuration, but are treated as a sample dimension (S
). If both S
and T
are present, the two axes are concatenated.
(Intermediate) Transforms#
Transforms are augmentations and any operation applied to the patches before feeding them into the network. CAREamics supports the following transforms (see configuration full spec for an example on how to configure them):
Transform | Description | Notes |
---|---|---|
Normalize | Normalize (zero mean, unit variance) | Necessary |
XYFlip | Flip the image along X and Y, one at a time | Can flip a single axis, optional |
XYRandomRotate90Model | Rotate by 90 degrees the XY axes | Optional |
N2VManipulateModel | N2V pixel manipulation | Only for N2V, in which case it is necessary |
The Normalize
transform is always applied, and the rest are optional. The exception is N2VManipulateModel
, which is only applied when training with N2V (see Noise2Void).
When to turn off transforms?
The configuration allows turning off transforms. In this case, only normalization (and potentially the N2VManipulateModel
for N2V) is applied. This is useful when the structures in your sample are always in the same orientation, and flipping and rotation do not make sense.
(Advanced) Custom data types#
To read custom data types, you can set data_type
to custom
in data_config
and provide a function that returns a numpy array from a path as read_source_func
parameter. The function will receive a Path object and an axies string as arguments, the axes being derived from the data_config
.
You should also provide a fnmatch
and Path.rglob
compatible expression (e.g. "*.npy") to filter the files extension using extension_filter
.
from pathlib import Path
from typing import Any
import numpy as np
from careamics.config import create_n2v_configuration
from careamics.lightning import TrainDataModule
def read_npy( # (1)!
path: Path, # (2)!
*args: Any,
**kwargs: Any, # (3)!
) -> np.ndarray:
return np.load(path) # (4)!
# example data
train_array = np.random.rand(128, 128)
np.save("train_array.npy", train_array)
# configuration
config = create_n2v_configuration(
experiment_name="n2v_2D",
data_type="custom", # (5)!
axes="YX",
patch_size=[32, 32],
batch_size=1,
num_epochs=1,
)
data_module = TrainDataModule(
data_config=config.data_config,
train_data="train_array.npy", # (6)!
read_source_func=read_npy, # (7)!
extension_filter="*.npy", # (8)!
)
data_module.prepare_data()
data_module.setup() # (9)!
# check dataset output
dataloader = data_module.train_dataloader()
print(dataloader.dataset[0][0].shape) # (10)!
-
We define a function that reads the custom data type.
-
It takes a path as argument!
-
But it also need to receive
*args
and**kwargs
to be compatible with theread_source_func
signature. -
It simply returns a
numpy
array. -
The data type must be
custom
! -
And we pass a
Path | str
. -
Simply pass the method by name.
-
We also need to provide an extension filter that is compatible with
fnmatch
andPath.rglob
. -
These two lines are necessary to instantiate the training dataset that we call at the end. They are called automatically by PyTorch Lightning during training.
-
The dataloader gives access to the dataset, we choose the first element, and since we configured CAREamics to use N2V, the output is a tuple whose first element is our first patch!
In practice, you should not access the dataloader directly (except for testing). Using custom types for training should be done as follows:
from pathlib import Path
from typing import Any
import numpy as np
from careamics import CAREamist
from careamics.config import create_n2v_configuration
from careamics.lightning import TrainDataModule
def read_npy(
path: Path,
*args: Any,
**kwargs: Any,
) -> np.ndarray:
return np.load(path)
# example data
train_array = np.random.rand(128, 128)
np.save("train_array.npy", train_array)
# configuration
config = create_n2v_configuration(
experiment_name="n2v_2D",
data_type="custom",
axes="YX",
patch_size=[32, 32],
batch_size=1,
num_epochs=1,
)
# Data module for custom types
data_module = TrainDataModule(
data_config=config.data_config,
train_data="train_array.npy",
read_source_func=read_npy,
extension_filter="*.npy",
)
# CAREamist
careamist = CAREamist(source=config)
# Train
careamist.train(datamodule=data_module)
Prediction datasets#
The prediction data module, PredictDataModule
works similarly to TrainDataModule
, albeit with different parameters:
pred_config
: inference configurationpred_data
: prediction data (array or path)(optional) read_source_func
: function to read custom data types (see custom data types)(optional) extension_filter
: filter to select custom types (see custom data types)
(Advanced) Subclass TrainDataModule#
The data module used in CAREamics have only a limited number of parameters, and they make use of the CAREamics datasets. If you need to have a different dataset, then you can subclass TrainDataModule
and override the setup
method to use your own datasets.