simet.services.subsampling¶
simet.services.subsampling ¶
SubsamplingService ¶
Utilities to size-match two datasets by random subsampling.
The main entry point subsample(...) compares dataset sizes from two
providers (real vs. synthetic) and, if their size ratio exceeds a
tolerance, returns wrapped providers where the larger dataset is
replaced by a Subset of the smaller size. The returned providers are
SubsampledProvider instances that expose the subsampled dataset.
Notes
- Randomness is driven by Python’s
random. For reproducibility, seed it beforehand (e.g., viaSeedingService.set_global_seed(seed)). - The transform passed to
subsampleis used to build datasets for measuring lengths and constructing the subsampled dataset. The final wrapped provider (aSubsampledProvider) already contains the subsampled dataset and ignores transforms later.
subsample
staticmethod
¶
subsample(real_provider, synth_provider, provider_transform, acceptable_ratio=1.1)
Return providers sized to within acceptable_ratio by subsampling.
Compares the lengths of datasets produced by real_provider and
synth_provider (using provider_transform). If the size ratio
max(n_real/n_synth, n_synth/n_real) is greater than
acceptable_ratio, randomly subsamples the larger dataset down to
the size of the smaller one and returns new providers that wrap those
subsampled datasets. Otherwise, returns the original providers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
real_provider
|
Provider
|
Provider for the real dataset. |
required |
synth_provider
|
Provider
|
Provider for the synthetic dataset. |
required |
provider_transform
|
Transform
|
Transform used to build datasets for size computation and for creating the subsampled dataset. |
required |
acceptable_ratio
|
float
|
Maximum tolerated size imbalance.
For example, |
1.1
|
Returns:
| Type | Description |
|---|---|
Provider
|
tuple[Provider, Provider]: Potentially updated providers. If |
Provider
|
subsampling occurs, the larger one is replaced by a |
tuple[Provider, Provider]
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If either dataset is empty. |
Example
If real has 10_000 samples and synth has 5_000:¶
ratio = 10_000/5_000 = 2.0 > 1.1 → subsample real to 5_000¶
real_p2, synth_p2 = SubsamplingService.subsample( ... real_provider, synth_provider, provider_transform, acceptable_ratio=1.1 ... )
Source code in simet/services/subsampling.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |