Skip to content

Data preparation

CAREamics supports by default data stored in memory as numpy arrays, but also data stored on disk in the form of TIFF, CZI and Zarr files. Each format comes with particular constraints.

Arrays

Arrays are the simplest and fastest way to train and predict with CAREamics, they can be passed as is to the CAREamist (list or single array).

TIFF

As TIFF are widely used, they are the most common use case of CAREamics. TIFF loading is compatible with in-memory training. If the data is too large to fit in memory, CAREamics will train by loading files from disk, one at a time and cycle through them to extract patches. While slower, this ensure that training is performed over the entire set of files.

To train on TIFF files, you can either pass a single path to a TIFF, a list of paths to TIFF files, or a path to a directory containing TIFF files. In the latter case, all TIFF files in the directory will be used for training.

CZI

The CZI format is used by Zeiss microscopes, and has constraints on the axes that should be specified: - S, C, Y and X are always present and in this order. Even in the absence of channels, the C axis should be present. - If T or Z is specified (axes=SCTYX or axes=SCZYX), then they will be used as depth axis (as Z). - T and Z are mutually exclusive.

Using CZI
from careamics.config.factories import create_n2v_config

# create a configuration
config = create_n2v_config(
    experiment_name="n2v_czi",
    data_type="czi",  # (1)!
    axes="SCZYX",  # (2)!
    patch_size=[16, 64, 64],  # (3)!
    batch_size=8,
    num_epochs=30,
    n_channels=1,  # (4)!
)
  1. We set the data type to czi.
  2. Axes must be one of SCYX, SCTYX or SCZYX.
  3. If Z or T is specified, we need to pass a 3D patch_size.
  4. The number of channels must be specified, and can be a singleton if there is a single channel.

Only paths to CZI files can be used as input to CAREamics. Passing a directory containing multiple CZI files is not accepted as input, the list of files should be passed explicitly.

CARE and Noise2Noise

This example is valid for CARE and Noise2Noise, albeit with their respective function to create a configuration and the difference in number of channels parameter naming. See the configuration section for more details.

Zarr

Zarr is a chunked format allowing to train on very large data without having to load it in memory. Because a Zarr file can hold multiple arrays, and can have arbitrary organization, we defined a flexible way to specify which data should be used.

There are three ways to specify which array(s) should be used for training or prediction:

  • Pointing to a Zarr file (path/to/file.zarr)
  • Pointing to a single Zarr group using a URI (file://path/to/file.zarr/group_name)
  • Pointing to a single Zarr array using a URI (file://path/to/file.zarr/group_name/array_name)

All these options are valid, and multiple can be bundled together in a list. The only constraints is that the list must contain only URIs or only paths to Zarr files, but not a mix of both. Zarr URIs can be constructed by getting a reference to a group or array, and calling group.store_path or array.store_path.

In the following example, we construct a Zarr file with arrays in different hierarchy levels, and showcase various ways to specify which array should be used for training.

Using Zarr
from pathlib import Path
import numpy as np
import zarr

# create a toy example
# - train_data.zarr"
#     - root_array_1: (128, 128)
#     - root_array_2: (128, 128)
#     - others
#         - array_2: (128, 128)
#         - array_3: (128, 128)
zarr_path = Path("train_data.zarr")
zarr_file = zarr.open(zarr_path, mode="w")
array_root_1 = zarr_file.create_array("array_1", data=np.random.rand(128, 128))
array_root_2 = zarr_file.create_array("array_2", data=np.random.rand(128, 128))

group = zarr_file.create_group("others")
array_2 = group.create_array("array_2", data=np.random.rand(128, 128))
array_3 = group.create_array("array_3", data=np.random.rand(128, 128))

# different ways to specify training data
train_from_zarr = zarr_path  # (1)!
train_from_group = str(group.store_path)  # (2)!
train_from_array = str(array_root_1.store_path)  # (3)!
train_from_list = [  # (4)!
    str(array_root_1.store_path),
    str(array_root_2.store_path),
    str(array_2.store_path),
    str(array_3.store_path),
]
  1. Only array_1 and array_2 will be loaded.
  2. Only arrays in others, i.e. array_2 and array_3, will be loaded.
  3. Only array_1 will be loaded.
  4. All arrays will be loaded, as they are all specified in the list.

OME-Zarr

Currently, we are ignoring whether a file is an OME-Zarr or not. As a result, simply passing a path to the Zarr file will fail, since CAREamics will expect arrays in the root of the file.

Therefore, to use an OME-Zarr file, you need to specify the URI to the array you want to train on.

In the near future, we will add full OME-Zarr support.

Multiscales OME-Zarr and Noise2Void

Noise2Void is very sensitive to the noise distribution in the data, if a an image has been downscaled, correlations may have been introduced in the noise, causing Noise2Void to perform poorly. We advise training on the raw unprocessed data if available.

Custom data formats

CAREamics allows reading formats not natively supported using two mechanisms:

  • Simple loading using a python function. All files with the expected file extension will be loaded in memory.
  • Advanced loading using a custom ImageStack implementation, useful for chunked or memory-mapped file formats.

Custom Read Function

This uses the same mechanism as training on in-memory TIFF files. A simple function that reads a path and returns a NumPy array can be provided when training or predicting, then the inputs can be specified using:

  • a path to a file,
  • a list of paths, or
  • a path to a directory.

See the Custom Read Function Tutorial for an example on using a custom read function.

Custom Image Stack Loader

Training and predicting on a custom memory-mapped or chunked file format is more complex, but it enables training without loading an entire image file into memory at once. In involves implementing an ImageStack class and an ImageStackLoader function to load the image stacks. The custom loading function can be implemented to accept any input type which will allow the same input type to be passed to training and prediction.

See the Custom Image Stack & Loader Tutorial for an example on using a custom read function.