Data preparation
CAREamics supports by default data stored in memory as numpy arrays, but also data stored on disk in the form of TIFF, CZI and Zarr files. Each format comes with particular constraints.
Arrays
Arrays are the simplest and fastest way to train and predict with CAREamics, they can be
passed as is to the CAREamist (list or single array).
TIFF
As TIFF are widely used, they are the most common use case of CAREamics. TIFF loading is compatible with in-memory training. If the data is too large to fit in memory, CAREamics will train by loading files from disk, one at a time and cycle through them to extract patches. While slower, this ensure that training is performed over the entire set of files.
To train on TIFF files, you can either pass a single path to a TIFF, a list of paths to TIFF files, or a path to a directory containing TIFF files. In the latter case, all TIFF files in the directory will be used for training.
CZI
The CZI format is used by Zeiss microscopes, and has constraints on the axes that should
be specified:
- S, C, Y and X are always present and in this order. Even in the absence of
channels, the C axis should be present.
- If T or Z is specified (axes=SCTYX or axes=SCZYX), then they will be used
as depth axis (as Z).
- T and Z are mutually exclusive.
from careamics.config.factories import create_n2v_config
# create a configuration
config = create_n2v_config(
experiment_name="n2v_czi",
data_type="czi", # (1)!
axes="SCZYX", # (2)!
patch_size=[16, 64, 64], # (3)!
batch_size=8,
num_epochs=30,
n_channels=1, # (4)!
)
- We set the data type to
czi. - Axes must be one of
SCYX,SCTYXorSCZYX. - If
ZorTis specified, we need to pass a 3Dpatch_size. - The number of channels must be specified, and can be a singleton if there is a single channel.
Only paths to CZI files can be used as input to CAREamics. Passing a directory containing multiple CZI files is not accepted as input, the list of files should be passed explicitly.
CARE and Noise2Noise
This example is valid for CARE and Noise2Noise, albeit with their respective function to create a configuration and the difference in number of channels parameter naming. See the configuration section for more details.
Zarr
Zarr is a chunked format allowing to train on very large data without having to load it in memory. Because a Zarr file can hold multiple arrays, and can have arbitrary organization, we defined a flexible way to specify which data should be used.
There are three ways to specify which array(s) should be used for training or prediction:
- Pointing to a Zarr file (
path/to/file.zarr) - Pointing to a single Zarr group using a URI (
file://path/to/file.zarr/group_name) - Pointing to a single Zarr array using a URI (
file://path/to/file.zarr/group_name/array_name)
All these options are valid, and multiple can be bundled together in a list. The only constraints
is that the list must contain only URIs or only paths to Zarr files, but not a mix of both. Zarr
URIs can be constructed by getting a reference to a group or array, and calling
group.store_path or array.store_path.
In the following example, we construct a Zarr file with arrays in different hierarchy levels, and showcase various ways to specify which array should be used for training.
from pathlib import Path
import numpy as np
import zarr
# create a toy example
# - train_data.zarr"
# - root_array_1: (128, 128)
# - root_array_2: (128, 128)
# - others
# - array_2: (128, 128)
# - array_3: (128, 128)
zarr_path = Path("train_data.zarr")
zarr_file = zarr.open(zarr_path, mode="w")
array_root_1 = zarr_file.create_array("array_1", data=np.random.rand(128, 128))
array_root_2 = zarr_file.create_array("array_2", data=np.random.rand(128, 128))
group = zarr_file.create_group("others")
array_2 = group.create_array("array_2", data=np.random.rand(128, 128))
array_3 = group.create_array("array_3", data=np.random.rand(128, 128))
# different ways to specify training data
train_from_zarr = zarr_path # (1)!
train_from_group = str(group.store_path) # (2)!
train_from_array = str(array_root_1.store_path) # (3)!
train_from_list = [ # (4)!
str(array_root_1.store_path),
str(array_root_2.store_path),
str(array_2.store_path),
str(array_3.store_path),
]
- Only
array_1andarray_2will be loaded. - Only arrays in
others, i.e.array_2andarray_3, will be loaded. - Only
array_1will be loaded. - All arrays will be loaded, as they are all specified in the list.
OME-Zarr
Currently, we are ignoring whether a file is an OME-Zarr or not. As a result, simply passing a path to the Zarr file will fail, since CAREamics will expect arrays in the root of the file.
Therefore, to use an OME-Zarr file, you need to specify the URI to the array you want to train on.
In the near future, we will add full OME-Zarr support.
Multiscales OME-Zarr and Noise2Void
Noise2Void is very sensitive to the noise distribution in the data, if a an image has been downscaled, correlations may have been introduced in the noise, causing Noise2Void to perform poorly. We advise training on the raw unprocessed data if available.
Custom data formats
CAREamics allows reading formats not natively supported using two mechanisms:
- Simple loading using a python function. All files with the expected file extension will be loaded in memory.
- Advanced loading using a custom
ImageStackimplementation, useful for chunked or memory-mapped file formats.
Custom Read Function
This uses the same mechanism as training on in-memory TIFF files. A simple function that reads a path and returns a NumPy array can be provided when training or predicting, then the inputs can be specified using:
- a path to a file,
- a list of paths, or
- a path to a directory.
See the Custom Read Function Tutorial for an example on using a custom read function.
Custom Image Stack Loader
Training and predicting on a custom memory-mapped or chunked file format is more complex, but it enables training without loading an entire image file into memory at once. In involves implementing an ImageStack class and an ImageStackLoader function to load the image stacks. The custom loading function can be implemented to accept any input type which will allow the same input type to be passed to training and prediction.
See the Custom Image Stack & Loader Tutorial for an example on using a custom read function.