Datasets and groups#

KonfAI works with grouped datasets. Each case lives in its own directory, and each file in that directory belongs to a named group such as CT, MR, SEG, or MASK.

Expected layout#

Typical layouts in the repository look like this:

Dataset/
├── CASE_001/
│   ├── CT.mha
│   └── SEG.mha
└── CASE_002/
    ├── CT.mha
    └── SEG.mha

Dataset/
├── CASE_001/
│   ├── MR.mha
│   ├── CT.mha
│   └── MASK.mha
└── CASE_002/
    ├── MR.mha
    ├── CT.mha
    └── MASK.mha

The concrete file extension is not restricted to .mha. KonfAI supports the extensions listed in konfai.utils.utils.SUPPORTED_EXTENSIONS.

Directory-backed formats use the same case/group model:

DicomDataset/CASE_001/CT/*.dcm
OmeDataset/CASE_001/CT.ome.zarr/

Two layers: format readers and dataset loaders#

Data handling is split across two packages with distinct responsibilities.

konfai/utils/ — format readers. These modules turn an on-disk file into a channel-first array plus its physical geometry; they are the only place that knows about file formats, and the per-backend table (format tokens, optional extras, API details) lives in Storage backends & formats.

konfai/data/ — PyTorch datasets and dataloaders. These modules build the torch.utils.data.Dataset / DataLoader machinery on top of the readers:

Module	Role
`konfai/data/data_manager.py`	grouped `Data*` datasets, `GroupTransform`, subset/validation splitting
`konfai/data/augmentation.py`	`DataAugmentationsList` — on-the-fly augmentation
`konfai/data/patching.py`	`DatasetPatch` — patch extraction and reassembly

Patch-native execution from storage#

KonfAI’s data engine plans work around patches rather than assuming an entire case is the unit of execution. DICOM, OME-Zarr, HDF5, and supported SimpleITK paths expose regional entry points, allowing a compatible pipeline to read the source window needed for the current patch.

When a transform cannot preserve this regional contract, KonfAI keeps the same workflow and selects a bounded full-volume buffer instead of producing an incorrect patch. This fallback is a safety property; it does not change the patch-native interface seen by training or prediction. See Processing large images for operational tuning and Patch streaming for the spatial-planner rules.

The `Attribute` class#

Reading a medical image is not just reading pixels — the physical geometry must travel with the array so predictions can be written back into the same space. konfai.utils.dataset.Attribute is the container that carries it.

Attribute is a dict[str, Any] subclass that stores, among other metadata, the three values that define an image in physical space:

Origin — physical position of the first voxel
Spacing — voxel size along each axis
Direction — the flattened direction-cosine matrix

Numeric values are stored as strings and recovered with get_np_array(key) or get_tensor(key). Keys use a stack-like naming scheme (Origin_0, Origin_1, …) so a chain of transforms can push successive geometries and pop them to invert the chain — which is how KonfAI restores the original geometry when exporting a prediction.

`groups_src` and `groups_dest`#

Each workflow describes how on-disk groups should be loaded through the Dataset.groups_src mapping.

Example:

Dataset:
  groups_src:
    CT:
      groups_dest:
        CT:
          transforms:
            Standardize:
              lazy: false
              mean: None
              std: None
              mask: None
              inverse: false
          is_input: true

Conceptually:

groups_src identifies what must exist on disk
groups_dest identifies how the loaded tensors are exposed to the workflow
is_input: true marks tensors that are fed into the model

The logic lives in konfai.data.data_manager.GroupTransform and the Data* dataset classes.

Dataset file selectors#

The dataset_filenames field accepts strings in the form:

path
path:format
path:flag:format

This behavior is implemented in konfai.data.data_manager.Data.get_data().

The most important conventions are:

a means “append / union”
i means “intersection / keep only common cases”

Examples:

./Dataset:a:mha
./Predictions/TRAIN_01/Dataset:i:mha
./DicomDataset:a:dicom
./OmeDataset:a:omezarr

Training subsets and validation#

KonfAI supports several ways to define subsets and validation sets.

From the dataset code, subset may be:

None
a slice string such as 0:10
a path to a text file listing case names
a ~path.txt exclusion file
a list of indices
a list of case names
a list of case-list files

From the dataset code, validation may be:

None
a float such as 0.2
a slice string such as 0:10
a path to a text file listing case names
a list of indices
a list of case names
a list mixing case names and case-list files

Three semantics are worth remembering:

subset: None keeps the full dataset;
validation: None disables the split;
~ exclusion applies to subset but not to validation.

The subset object is applied before validation splitting and can exclude or include items. The exact logic is implemented by TrainSubset and PredictionSubset.

Caching, augmentation, and patching#

At the dataset level, KonfAI can:

cache transformed data in memory
generate multiple augmentations per item
split volumes into patches before they reach the model

This is handled by:

konfai.data.data_manager.DataTrain
konfai.data.augmentation.DataAugmentationsList
konfai.data.patching.DatasetPatch

When to use dataset patching#

Use Dataset.Patch when:

volumes are too large to process at once
you want 2D, 2.5D, or 3D crops sampled from larger volumes
you need sliding-window style training or inference

Dataset patching is separate from model patching, which applies inside the network itself. See Model graph and output naming.

Next steps#

Model graph and output naming — to attach losses and metrics that target these groups
Storage backends & formats — format tokens, optional extras, and the DICOM / OME-Zarr APIs
Processing large images — regional execution, bounded fallback, and tuning
Training configuration — the Dataset keys in the context of a full training config