Add report generation strategy for the MegatronRun#787
Conversation
📝 WalkthroughWalkthroughAdds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryAdds Key changes:
Confidence Score: 5/5
Important Files Changed
|
amaslenn
left a comment
There was a problem hiding this comment.
We are in the code freeze stage now, this PR will be merged later.
src/cloudai/workloads/megatron_run/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)
30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.
ITERATION_REGEXrequires TFLOP/s, so logs that only print iteration time will be ignored;can_handle_directory()then returns False andget_metric("iteration-time")returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.🔧 Proposed fix
-ITERATION_REGEX = re.compile( - r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" - r".*?" - r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)", - re.IGNORECASE, -) +ITERATION_REGEX = re.compile( + r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" + r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?", + re.IGNORECASE, +) def _extract(self, log_path: Path) -> tuple[list[float], list[float]]: """Extract iteration times (ms) and GPU TFLOPS from the log file.""" - iter_times_ms: list[float] = [] - gpu_tflops: list[float] = [] + records: list[tuple[float, float | None]] = [] with log_path.open("r", encoding="utf-8", errors="ignore") as f: for line in f: m = ITERATION_REGEX.search(line) if m: try: - iter_times_ms.append(float(m.group(1))) - gpu_tflops.append(float(m.group(2))) + iter_time = float(m.group(1)) + tflops = float(m.group(2)) if m.group(2) is not None else None + records.append((iter_time, tflops)) except (ValueError, TypeError): logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n")) # Skip the first 20 iterations for statistics (to exclude warmup) - if len(iter_times_ms) > 20: - iter_times_ms = iter_times_ms[20:] - gpu_tflops = gpu_tflops[20:] - return iter_times_ms, gpu_tflops + if len(records) > 20: + records = records[20:] + iter_times_ms = [t for t, _ in records] + gpu_tflops = [g for _, g in records if g is not None] + return iter_times_ms, gpu_tflops
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.
In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py
Show resolved
Hide resolved
Keeping last 10 iters for consistency This reverts commit 0efbc49.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 103-121: Extract the magic number 10 into a named class constant
for discoverability and future tuning: add something like _TAIL_ITERATIONS = 10
on the class, then replace the hardcoded checks and slices in _extract
(iter_times_ms and gpu_tflops handling) to use that constant (e.g., if
len(iter_times_ms) > self._TAIL_ITERATIONS: iter_times_ms =
iter_times_ms[-self._TAIL_ITERATIONS:] and similarly for gpu_tflops) so the
behavior is unchanged but the window size is configurable/readable.
- Around line 169-186: get_metric currently calls _get_extracted_data() every
time, causing repeated file I/O and parsing; modify _get_extracted_data (or add
a new helper) to cache its results so subsequent calls return the parsed tuple
without re-reading stdout.txt. You can implement caching either by decorating
_get_extracted_data with functools.lru_cache()/functools.cache (since it takes
only self) or by storing the parsed tuple on the instance (e.g.,
self._cached_extracted_data) and returning it if present; ensure get_metric
continues to call _get_extracted_data and that cache invalidation isn’t required
for immutable benchmark outputs.
- Around line 85-91: Add a brief explanatory comment above the results_file
property clarifying that get_log_file() returns an existing Path or None while
results_file intentionally always returns the canonical Path to the expected
"stdout.txt" location (self.test_run.output_path / "stdout.txt") even if the
file doesn't exist; reference both get_log_file and results_file in the comment
so future readers understand this predictable-accessor vs existence-check
distinction.
|
@juntaowww I just want to double check if we can do it now or you have some updates for this one. Let me know if we can merge it as is. |
@amaslenn good for merging now. cc @srivatsankrishnan we could iterate base on this version. |
Summary
Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.
Also fixes the configuration overwrite behavior for
agent_metricsin test and scenario toml setups. Originally ifagent_metricsis set in test configuration but not in scenario configuration, theagent_metricswould be overwrited by[default,]. Nowagent_metricsin test configuration could be correctly propagated whenagent_metricsis not set in scenario configuration.Test Plan
Adds following to the test configurations
megatron_run_report.csvwould be generated andtrajectory.csvwould reflect the observations and rewards.