Updated regerssion sampler by AbasKhan · Pull Request #251 · Modalities/ml_filter

AbasKhan · 2025-12-11T00:46:18Z

This PR introduces a configurable uniform split sampler for JSONL datasets, along with its configuration and tests. 🎯

Adds UniformSplitSampler, which:

📥 Ingests one or more input JSONL files.
🔢 Requires a numeric integer score field per example and normalizes these scores.
✂️ Produces balanced train/validation JSONL outputs according to the configured split ratio.

_build_splits

🧩 Groups examples by label and computes per-label quotas for train/validation.
🧷 Uses split_label_pools to partition each label’s pool into train/val.
📈 Applies sampling with a max_oversampling_ratio cap so rare labels are upsampled but not duplicated arbitrarily.
🔀 Shuffles the resulting splits to avoid ordering artifacts.
📊 Logs overall and per-label distributions for transparency and easier debugging.

In short: this sampler helps you build balanced, reproducible, and well-logged train/val splits from JSONL.

…nfiguration support

src/ml_filter/sampling/uniform_split_sampler.py

src/ml_filter/utils/uniform_split_sampler_utils.py

src/ml_filter/sampling/uniform_split_sampler.py

src/ml_filter/utils/uniform_split_sampler_utils.py

ajude2s · 2025-12-12T10:00:51Z

src/ml_filter/utils/uniform_split_sampler_utils.py

+    if max_allowed <= 0:
+        return pool.head(0).copy()


Redundant. There is a check above if pool.empty or target <= 0.
max_allowed will be less than or equal to 0 only if max_oversampling_ratio is less than 0.

Which would never be the case, right?

ajude2s

Awesome work, Abbas. 👍
I have added minor changes and some suggestions.

Also, should we not add the original sampler (the fixed distribution which yields "best" performance) as well?

…nsistency

AbasKhan added 2 commits December 11, 2025 01:39

feat: implement uniform split sampler with capped oversampling and co…

2b75b0a

…nfiguration support

feat: add uniform split sampler CLI entry point and tests

1f642ba

AbasKhan requested a review from ajude2s December 11, 2025 00:47

ajude2s reviewed Dec 11, 2025

View reviewed changes

src/ml_filter/sampling/uniform_split_sampler.py Outdated Show resolved Hide resolved

ajude2s reviewed Dec 11, 2025

View reviewed changes

src/ml_filter/utils/uniform_split_sampler_utils.py Outdated Show resolved Hide resolved