Custom data formats
As mentioned in the Data Preparation Guide CAREamics provides two mechanisms for training and predicting on custom data types. There is:
- "read function", which can be used to read all the data to memory; and
- "image stack loader", which is more advanced but can be used for chunked or memory-mapped file formats.
Custom Read Function
Any function that loads image data from a path and outputs a numpy array can be used.
This example will show how data saved in the .npy format can be loaded for training and prediction.
First, we will save some toy data and create a CAREamics configuration object.
from pathlib import Path
import numpy as np
from careamics import CAREamist, ReadFuncLoading
from careamics.config import create_n2v_config
# --- create toy data
DATA_PATH = Path("data")
DATA_PATH.mkdir(exist_ok=True)
n_files = 5
image_shape = (512, 512)
file_paths: list[Path] = []
for i in range(5):
image = np.random.rand(*image_shape)
file_path = DATA_PATH / f"image_{i}.npy"
np.save(file_path, image)
file_paths.append(file_path)
# 2D image files in a directory, each with the shape (512, 512)
# data/
# ├── image_0.npy
# ├── image_1.npy
# ├── image_2.npy
# ├── image_3.npy
# └── image_4.npy
# --- configuration
config = create_n2v_config(
"loading-custom",
data_type="custom", # (1)!
axes="YX", # (2)!
patch_size=(64, 64),
batch_size=16,
num_epochs=10,
)
- The
data_typemust be set to"custom". - The axes of each file.
To train and predict on the data we need to define a function to read the data that matches the protocol described by [ReadFunc][careamics.file_io.ReadFunc]. That is, the first argument MUST be named file_path and the return type must be a numpy array. Here we just make a simple wrapper around numpy.load to have the correct function signature.
For training with CAREamist we pass our custom loading function to the loading argument of CAREamist.train, it needs to be contained in the ReadFuncLoading dataclass.
# Wrapping numpy.load
# The call signature should match the protocol `ReadFunc`
def read_numpy(file_path: Path) -> np.ndarray:
return np.load(file_path)
careamist = CAREamist(config)
careamist.train(
train_data=DATA_PATH, # (1)!
loading=ReadFuncLoading( # (2)!
read_numpy, # (3)!
extension_filter="*.npy", # (4)!
),
)
- The input works the same as for tiff, it can be a single file, a list of files or a directory. Here we demonstrate passing a directory.
- The arguments for custom loading are wrapped in a data class
ReadFuncLoading. - Our custom read function.
- An extension filter, it uses glob-style pattern matching. This allows us to pass a directory as input.
Prediction works very similarly to training. CAREamist.predict outputs the source of the predictions which we can verify are the paths of our data.
predictions, sources = careamist.predict( # (1)!
pred_data=DATA_PATH,
loading=ReadFuncLoading(
read_numpy,
extension_filter="*.npy",
),
)
# inspect the sources of the predictions
sources
- The same arguments can also be passed to
CAREamist.predict_to_disk.
['data/image_0.npy',
'data/image_1.npy',
'data/image_2.npy',
'data/image_3.npy',
'data/image_4.npy']
Custom Image Stack & Loader
Training and predicting on a custom memory-mapped or chunked file format is more complex, but it enables training without loading an entire image file into memory at once. It involves implementing an ImageStack class and an ImageStackLoader function to load the image stacks.
This example will demonstrated how data from a HDF5 file can be loaded for training and prediction.
First, we will save some toy data and create a CAREamics configuration object.
from collections.abc import Sequence
from pathlib import Path
import h5py
import numpy as np
from numpy.typing import DTypeLike, NDArray
from careamics import CAREamist, ImageStackLoading
from careamics.config import create_n2v_config
from careamics.utils.reshape_array import reshape_patch, get_patch_slices, AxesTransform
from careamics.dataset.image_stack.image_utils import pad_patch
DATA_PATH = Path("data")
# --- create toy data
n_files = 5
image_shape = (512, 512)
hdf5_path = DATA_PATH / "dataset.h5"
with h5py.File(hdf5_path, "w") as hdf5_file:
for i in range(5):
image = np.random.rand(*image_shape)
data_path = f"image_{i}"
hdf5_file.create_dataset(name=data_path, data=image)
# HDF5 file with 5 image datasets at the root
# dataset.h5/
# ├── image_0
# ├── image_1
# ├── image_2
# ├── image_3
# └── image_4
# --- configuration
config = create_n2v_config(
"loading-custom",
data_type="custom", # (1)!
axes="YX", # (2)!
patch_size=(64, 64),
batch_size=16,
num_epochs=10,
)
- The
data_typemust be set to"custom". - The axes of each HDF5 dataset.
Now we will define our custom HDF5ImageStack and a load_hd5fs function. See the Implementing an Image Stack Tutorial for a more in depth explanation of how to create an image stack class.
To adhere to the ImageStackLoader protocol the load_hdf5s function MUST have a source argument and an axes argument. The source argument can have any type, and the axes argument Must be a string - a subset of "SCTZYX". The return type MUST be a sequence of ImageStack objects. Additional arguments are allowed.
Supervised Algorithms, e.g. CARE
It is up to the loading function to always load images in a deterministic order so the inputs are matched to their corresponding targets. In our example, h5py will return the group keys in alpha-numeric order; so we know that if we provide a target .h5 file which has the same structure as our input file, then the target images are returned in the same order as our input images.
# --- custom ImageStack, that adheres to the ImageStack protocol
# Adapted from the careamics native ZarrImageStack
class HDF5ImageStack:
def __init__(self, image_data: h5py.Dataset, axes: str):
self._image_data = image_data
self.original_axes = axes
self.original_data_shape = image_data.shape
self.data_shape = AxesTransform(
axes, self.original_data_shape
).transformed_shape
@property
def data_dtype(self) -> DTypeLike:
return self._image_data.dtype
@property
def source(self) -> str: # (1)!
return "#".join([self._image_data.file.filename, str(self._image_data.name)])
def extract_patch(
self,
sample_idx: int,
channels: Sequence[int] | None,
coords: Sequence[int],
patch_size: Sequence[int],
) -> NDArray:
"""Extract a patch for a given sample and channels within the image stack.
Parameters
----------
sample_idx : int
Sample index.
channels : sequence of int or None
Channel indices to extract. If `None`, all channels will be extracted.
coords : sequence of int
Spatial coordinates of the top-left corner of the patch.
patch_size : sequence of int
Size of the patch in each spatial dimension.
Returns
-------
numpy.ndarray
A patch of the image data from a particular sample with dimensions C(Z)YX.
"""
patch_slice = get_patch_slices(
self.original_axes,
self.original_data_shape,
sample_idx,
channels,
coords,
patch_size,
)
patch_data = self._image_data[patch_slice] # type: ignore
patch_data = reshape_patch(patch_data, self.original_axes)
patch = pad_patch(coords, patch_size, self.data_shape, patch_data)
return patch
# helper function
def _walk_hdf5(group: h5py.Group):
"""Iterate through every dataset contained in a HDF5 group"""
keys = group.keys()
for key in keys:
node = group.get(key)
if isinstance(node, h5py.Dataset):
yield node
elif isinstance(node, h5py.Group):
yield from _walk_hdf5(node)
return
# --- Define the loading function, adhering to the ImageStackLoader protocol
# NOTE: this is just one way to define a HDF5 loader, it could be adapted to:
# - Load from a list of HDF5 files, or
# - Load from a subset HDF5 groups within the file.
def load_hdf5s(source: h5py.File, axes: str) -> list[HDF5ImageStack]: # (2)!
"""
Load all the images in a HDF5 file.
source : Path
The HDF5 file.
axes : str
Axes order of the data (e.g. "SYX", "YXC").
"""
image_stacks: list[HDF5ImageStack] = []
for image_data in _walk_hdf5(source):
image_stacks.append(HDF5ImageStack(image_data, axes))
return image_stacks
- The source property is used track the data, and will be returned alongside the predictions. It should be unique for each image stack.
- Adheres
ImageStackLoaderprotocol call signature.
Now training and prediction is relatively simple, we simply pass our loading function to CAREamist.train and CAREamist.predict. The loading function needs to be wrapped in the [ImageStackLoading][careamics.ImageStackLoading] dataclass, where additional arguments to the function can also be included, if required.
careamist = CAREamist(config)
# --- train
hdf5_file = h5py.File(hdf5_path) # keep a reference to the open HDF5 file
careamist.train(
train_data=hdf5_file, # (1)!
loading=ImageStackLoading(load_hdf5s), # (2)!
)
# --- predict
prediction, sources = careamist.predict(
pred_data=hdf5_file,
loading=ImageStackLoading(load_hdf5s),
)
hdf5_file.close() # close the file once done, see h5py docs
sources # inspect the sources of the predictions # (3)!
- The input type corresponds to the
sourcetype in our loading function, ah5py.Fileobject. - Our loading function wrapped in the
ImageStackLoadingdataclass. - These will match the format we that defined in
HDF5ImageStack.source.