Skip to content

simet.services.feature_cache

simet.services.feature_cache

FeatureCacheService

FeatureCacheService(cache_dir=Path('cache/features'))

Disk cache for feature matrices computed from DataLoaders.

Stores/loads precomputed feature arrays (e.g., (N, D) floats) keyed by a deterministic hash derived from: - dataset identity (root path or dataset type/length), - subset membership (sorted indices for torch.utils.data.Subset), - feature-extractor suffix (e.g., "inception_v3"), - loader parameters (e.g., batch_size).

Features are serialized via pickle under cache_dir/<md5>.pkl.

Parameters:

Name Type Description Default
cache_dir Path | str

Directory where cached feature files are written/read. Created if it does not exist. Defaults to "cache/features".

Path('cache/features')
Example

svc = FeatureCacheService("cache/features") feats = svc.get_or_compute( ... loader=my_loader, ... compute_fn=my_extractor_fn, # def f(loader) -> np.ndarray ... cache_key_suffix="inception_v3", ... ) feats.shape # doctest: +SKIP (N, D)

Notes
  • Security: pickle is not safe for untrusted inputs. Only load files created by this application.
  • Subsets: For Subset datasets, the cache key includes a hash of the sorted indices, making different subsets cache to different files.
  • Invalidation: Changing any key component (dataset path/size, suffix, batch size) produces a different cache key and thus a cache miss.

Initialize the cache directory and logger.

Parameters:

Name Type Description Default
cache_dir Path

Directory to store cache files (*.pkl).

Path('cache/features')
Source code in simet/services/feature_cache.py
50
51
52
53
54
55
56
57
58
def __init__(self, cache_dir: Path = Path("cache/features")) -> None:
    """Initialize the cache directory and logger.

    Args:
        cache_dir: Directory to store cache files (`*.pkl`).
    """
    logger.debug(f"Initializing FeatureCacheService with cache_dir: {cache_dir}")
    self.cache_dir = cache_dir
    self.cache_dir.mkdir(parents=True, exist_ok=True)

get_or_compute

get_or_compute(loader, compute_fn, cache_key_suffix='', force_recompute=False)

Return cached features for loader or compute and cache them.

Builds a stable cache key (see _generate_cache_key) and attempts to load the feature array. If missing or force_recompute=True, runs compute_fn(loader), saves the result, and returns it.

Parameters:

Name Type Description Default
loader DataLoader[VisionDataset]

DataLoader providing samples for feature extraction.

required
compute_fn Callable[[DataLoader[VisionDataset]], ndarray]

Callable that computes the feature matrix from loader and returns a NumPy array, typically shape (N, D).

required
cache_key_suffix str

Disambiguator for different extractors/configs (e.g., "inception_v3", "resnet50_pool5").

''
force_recompute bool

If True, bypass cache and recompute, then overwrite the cache entry.

False

Returns:

Type Description
ndarray

np.ndarray: The feature matrix for all samples in loader.

Logging

Emits INFO on cache hits/misses and DEBUG with the resolved cache path.

Source code in simet/services/feature_cache.py
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def get_or_compute(
    self,
    loader: DataLoader[VisionDataset],
    compute_fn: Callable[[DataLoader[VisionDataset]], np.ndarray],
    cache_key_suffix: str = "",
    force_recompute: bool = False,
) -> np.ndarray:
    """Return cached features for `loader` or compute and cache them.

    Builds a stable cache key (see `_generate_cache_key`) and attempts to
    load the feature array. If missing or `force_recompute=True`, runs
    `compute_fn(loader)`, saves the result, and returns it.

    Args:
        loader: DataLoader providing samples for feature extraction.
        compute_fn: Callable that computes the feature matrix from `loader`
            and returns a NumPy array, typically shape `(N, D)`.
        cache_key_suffix: Disambiguator for different extractors/configs
            (e.g., `"inception_v3"`, `"resnet50_pool5"`).
        force_recompute: If True, bypass cache and recompute, then overwrite
            the cache entry.

    Returns:
        np.ndarray: The feature matrix for all samples in `loader`.

    Logging:
        Emits INFO on cache hits/misses and DEBUG with the resolved cache path.
    """
    cache_key = self._generate_cache_key(loader, cache_key_suffix)
    logger.debug(
        f"Generated cache key: {cache_key} for loader with dataset: {type(loader.dataset).__name__}"
    )
    cache_path = self._get_cache_path(cache_key)
    logger.debug(f"Cache path resolved to: {cache_path}")

    if not force_recompute:
        cached_features = self._load_from_cache(cache_path)
        if cached_features is not None:
            logger.info(f"Loaded features from cache: {cache_path}")
            return cached_features

    if force_recompute:
        logger.info(
            f"Force recompute requested, computing features and saving to {cache_path}"
        )
    else:
        logger.info(f"Cache miss, computing features and saving to {cache_path}")

    features = compute_fn(loader)
    self._save_to_cache(features, cache_path)
    return features