Skip to content

simet.services.roc_auc

simet.services.roc_auc

RocAucService

Utilities for feature standardization used in ROC-AUC workflows.

standardize_train staticmethod

standardize_train(X)

Fit standardization parameters on X and return the standardized data.

Computes per-feature mean and standard deviation over the batch and returns the standardized tensor along with the fitted parameters.

Parameters:

Name Type Description Default
X Tensor

Input features of shape (n_samples, n_features). Can be any floating dtype; output matches X.dtype.

required

Returns:

Type Description
tuple[Tensor, Tensor, Tensor]

tuple[torch.Tensor, torch.Tensor, torch.Tensor]: - X_std: Standardized features, shape (n_samples, n_features). - mu: Per-feature mean, shape (1, n_features). - sigma: Per-feature std (clipped to >= 1e-6), shape (1, n_features).

Notes
  • Uses sigma = std.clamp_min(1e-6) to avoid division by zero.
  • Statistics are computed along dim=0 with keepdim=True so they broadcast correctly when standardizing.
  • For reproducible pipelines, persist mu and sigma for use on validation/test sets.
Source code in simet/services/roc_auc.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@staticmethod
def standardize_train(
    X: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Fit standardization parameters on `X` and return the standardized data.

    Computes per-feature mean and standard deviation over the **batch** and
    returns the standardized tensor along with the fitted parameters.

    Args:
        X (torch.Tensor): Input features of shape ``(n_samples, n_features)``.
            Can be any floating dtype; output matches `X.dtype`.

    Returns:
        tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
            - ``X_std``: Standardized features, shape ``(n_samples, n_features)``.
            - ``mu``: Per-feature mean, shape ``(1, n_features)``.
            - ``sigma``: Per-feature std (clipped to >= 1e-6), shape ``(1, n_features)``.

    Notes:
        - Uses ``sigma = std.clamp_min(1e-6)`` to avoid division by zero.
        - Statistics are computed along ``dim=0`` with ``keepdim=True`` so they
          broadcast correctly when standardizing.
        - For reproducible pipelines, persist ``mu`` and ``sigma`` for use on
          validation/test sets.
    """
    mu = X.mean(dim=0, keepdim=True)
    sigma = X.std(dim=0, keepdim=True).clamp_min(1e-6)
    return (X - mu) / sigma, mu, sigma

standardize_with staticmethod

standardize_with(X, mu, sigma)

Standardize X using provided per-feature mean and std.

Parameters:

Name Type Description Default
X Tensor

Input features, shape (n_samples, n_features).

required
mu Tensor

Per-feature mean, shape (1, n_features) (or broadcastable).

required
sigma Tensor

Per-feature std, shape (1, n_features) (or broadcastable). Should be strictly positive; if computed elsewhere, consider clamping.

required

Returns:

Type Description
Tensor

torch.Tensor: Standardized features of the same shape/dtype/device as X.

Notes
  • mu and sigma are typically obtained from standardize_train on the training set and reused for validation/test to avoid data leakage.
  • Relies on PyTorch broadcasting; alternative shapes that broadcast (e.g., (n_features,)) are also accepted.
Source code in simet/services/roc_auc.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@staticmethod
def standardize_with(
    X: torch.Tensor, mu: torch.Tensor, sigma: torch.Tensor
) -> torch.Tensor:
    """Standardize `X` using provided per-feature mean and std.

    Args:
        X (torch.Tensor): Input features, shape ``(n_samples, n_features)``.
        mu (torch.Tensor): Per-feature mean, shape ``(1, n_features)`` (or broadcastable).
        sigma (torch.Tensor): Per-feature std, shape ``(1, n_features)`` (or broadcastable).
            Should be strictly positive; if computed elsewhere, consider clamping.

    Returns:
        torch.Tensor: Standardized features of the same shape/dtype/device as `X`.

    Notes:
        - `mu` and `sigma` are typically obtained from `standardize_train` on the
          training set and reused for validation/test to avoid data leakage.
        - Relies on PyTorch broadcasting; alternative shapes that broadcast
          (e.g., ``(n_features,)``) are also accepted.
    """
    return (X - mu) / sigma