Add report generation strategy for the MegatronRun by juntaowww · Pull Request #787 · NVIDIA/cloudai

juntaowww · 2026-01-22T09:07:08Z

Summary

Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.

Also fixes the configuration overwrite behavior for agent_metrics in test and scenario toml setups. Originally if agent_metrics is set in test configuration but not in scenario configuration, the agent_metrics would be overwrited by [default,]. Now agent_metrics in test configuration could be correctly propagated when agent_metrics is not set in scenario configuration.

Test Plan

Adds following to the test configurations

agent_metrics = ["tflops-per-gpu"]
agent_reward_function = "identity"  # For maximizing TFLOP/s

# Or
# agent_metrics = ["iteration-time"] # For minimizing iteration time

megatron_run_report.csv would be generated and trajectory.csv would reflect the observations and rewards.

coderabbitai · 2026-01-22T09:07:31Z

📝 Walkthrough

Walkthrough

Adds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests.

Changes

Cohort / File(s)	Summary
Report strategy implementation `src/cloudai/workloads/megatron_run/report_generation_strategy.py`	New `MegatronRunReportGenerationStrategy` class. Parses `stdout.txt` for iteration-time (ms) and tflops-per-gpu, keeps recent iterations, computes avg/median/min/max/std, writes `megatron_run_report.csv`, exposes `get_metric` and helpers, and handles missing/invalid data.
Package exports `src/cloudai/workloads/megatron_run/__init__.py`	Added `MegatronRunReportGenerationStrategy` to module exports (`__all__`) and updated copyright year.
Registration `src/cloudai/registration.py`	Imported and registered `MegatronRunReportGenerationStrategy` for `MegatronRunTestDefinition` via `Registry().add_report(...)`; minor copyright year bump.
Models `src/cloudai/models/scenario.py`	`TestRunModel.tdef_model_dump` changed to emit `agent_metrics` only when present in the model's fields set (otherwise `None`).
Tests `tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`, `tests/test_test_scenario.py`	New unit tests and fixtures for MegatronRun strategy (stdout samples, detection, CSV structure/content, metric extraction including invalid/no-data); test scenario updated to include the new strategy and agent_metrics coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I nibble logs with whiskers keen,
find iteration times and TFLOP/s unseen.
I skip the first hops, then average the rest,
a neat little CSV to show which run's best.
Hooray — the rabbit reports pass the test! ✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding a report generation strategy for MegatronRun workloads.
Description check	✅ Passed	The description clearly relates to the changeset, explaining the new report generation strategy for MegatronRun and fixes to agent_metrics configuration behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-22T09:09:45Z

Greptile Overview

Greptile Summary

Adds MegatronRunReportGenerationStrategy to extract iteration time and GPU TFLOP/s metrics from MegatronRun training logs, enabling Design Space Exploration (DSE) capabilities. The implementation follows the established pattern from MegatronBridgeReportGenerationStrategy with similar structure and error handling.

Key changes:

New report generation strategy parses stdout.txt for iteration metrics using regex pattern matching
Skips first 20 iterations to exclude warmup period (vs MegatronBridge's approach of keeping last 10)
Generates CSV reports with statistics (avg, median, min, max, std) for both iteration time (ms) and TFLOP/s per GPU
Supports three metric types: default, iteration-time, and tflops-per-gpu for DSE agent integration
Updates scenario.py to conditionally serialize agent_metrics only when explicitly set, preventing default values from being written
Comprehensive test coverage with multiple edge cases and validation scenarios

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation follows well-established patterns from MegatronBridge, has comprehensive test coverage including edge cases, properly handles errors, and all changes are additive without modifying existing functionality. The scenario.py change is a minor improvement to serialization logic that correctly uses Pydantic's model_fields_set.
No files require special attention

Important Files Changed

Filename	Overview
src/cloudai/workloads/megatron_run/report_generation_strategy.py	Adds new MegatronRunReportGenerationStrategy class to extract iteration time and TFLOP/s metrics from stdout.txt, following similar patterns to MegatronBridge. Implementation looks solid with proper error handling and warmup iteration skipping.
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py	Comprehensive test coverage for the new report generation strategy including happy path, edge cases, and metric extraction validation.
src/cloudai/models/scenario.py	Updates tdef_model_dump to conditionally serialize agent_metrics only if explicitly set, preventing default values from being serialized.

Sequence Diagram

sequenceDiagram
    participant User
    participant System
    participant Registry
    participant MegatronRunRGS as MegatronRunReportGenerationStrategy
    participant TestRun
    participant LogFile as stdout.txt
    participant ReportFile as megatron_run_report.csv

    User->>System: Execute MegatronRun workload
    System->>TestRun: Create TestRun instance
    TestRun->>LogFile: Generate stdout.txt with iteration metrics
    
    User->>System: Request report generation
    System->>Registry: Get report strategies for MegatronRunTestDefinition
    Registry-->>System: Return [CheckpointTimingRGS, MegatronRunRGS]
    
    System->>MegatronRunRGS: can_handle_directory()
    MegatronRunRGS->>LogFile: Open and search for ITERATION_REGEX
    LogFile-->>MegatronRunRGS: Pattern found/not found
    MegatronRunRGS-->>System: Return true/false
    
    alt Can handle directory
        System->>MegatronRunRGS: generate_report()
        MegatronRunRGS->>LogFile: Extract iteration times and TFLOP/s
        LogFile-->>MegatronRunRGS: Raw metrics data
        MegatronRunRGS->>MegatronRunRGS: Skip first 20 iterations (warmup)
        MegatronRunRGS->>MegatronRunRGS: Calculate statistics (mean, median, min, max, pstdev)
        MegatronRunRGS->>ReportFile: Write CSV with metrics
        ReportFile-->>MegatronRunRGS: Success
        
        User->>System: Request metric value
        System->>MegatronRunRGS: get_metric("iteration-time" or "tflops-per-gpu")
        MegatronRunRGS->>LogFile: Re-extract data
        MegatronRunRGS->>MegatronRunRGS: Calculate mean
        MegatronRunRGS-->>System: Return metric value
        System-->>User: Metric value for DSE
    else Cannot handle directory
        MegatronRunRGS-->>System: Cannot handle
        System->>Registry: Try next strategy
    end

amaslenn

We are in the code freeze stage now, this PR will be merged later.

src/cloudai/workloads/megatron_run/report_generation_strategy.py

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)

30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.

ITERATION_REGEX requires TFLOP/s, so logs that only print iteration time will be ignored; can_handle_directory() then returns False and get_metric("iteration-time") returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.

🔧 Proposed fix

-ITERATION_REGEX = re.compile(
-    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
-    r".*?"
-    r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)",
-    re.IGNORECASE,
-)
+ITERATION_REGEX = re.compile(
+    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
+    r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?",
+    re.IGNORECASE,
+)

 def _extract(self, log_path: Path) -> tuple[list[float], list[float]]:
     """Extract iteration times (ms) and GPU TFLOPS from the log file."""
-    iter_times_ms: list[float] = []
-    gpu_tflops: list[float] = []
+    records: list[tuple[float, float | None]] = []
     with log_path.open("r", encoding="utf-8", errors="ignore") as f:
         for line in f:
             m = ITERATION_REGEX.search(line)
             if m:
                 try:
-                    iter_times_ms.append(float(m.group(1)))
-                    gpu_tflops.append(float(m.group(2)))
+                    iter_time = float(m.group(1))
+                    tflops = float(m.group(2)) if m.group(2) is not None else None
+                    records.append((iter_time, tflops))
                 except (ValueError, TypeError):
                     logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n"))

     # Skip the first 20 iterations for statistics (to exclude warmup)
-    if len(iter_times_ms) > 20:
-        iter_times_ms = iter_times_ms[20:]
-        gpu_tflops = gpu_tflops[20:]
-    return iter_times_ms, gpu_tflops
+    if len(records) > 20:
+        records = records[20:]
+    iter_times_ms = [t for t, _ in records]
+    gpu_tflops = [g for _, g in records if g is not None]
+    return iter_times_ms, gpu_tflops

🤖 Fix all issues with AI agents

In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.

In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).

src/cloudai/workloads/megatron_run/report_generation_strategy.py

tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py

src/cloudai/models/scenario.py

Keeping last 10 iters for consistency This reverts commit 0efbc49.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 103-121: Extract the magic number 10 into a named class constant
for discoverability and future tuning: add something like _TAIL_ITERATIONS = 10
on the class, then replace the hardcoded checks and slices in _extract
(iter_times_ms and gpu_tflops handling) to use that constant (e.g., if
len(iter_times_ms) > self._TAIL_ITERATIONS: iter_times_ms =
iter_times_ms[-self._TAIL_ITERATIONS:] and similarly for gpu_tflops) so the
behavior is unchanged but the window size is configurable/readable.
- Around line 169-186: get_metric currently calls _get_extracted_data() every
time, causing repeated file I/O and parsing; modify _get_extracted_data (or add
a new helper) to cache its results so subsequent calls return the parsed tuple
without re-reading stdout.txt. You can implement caching either by decorating
_get_extracted_data with functools.lru_cache()/functools.cache (since it takes
only self) or by storing the parsed tuple on the instance (e.g.,
self._cached_extracted_data) and returning it if present; ensure get_metric
continues to call _get_extracted_data and that cache invalidation isn’t required
for immutable benchmark outputs.
- Around line 85-91: Add a brief explanatory comment above the results_file
property clarifying that get_log_file() returns an existing Path or None while
results_file intentionally always returns the canonical Path to the expected
"stdout.txt" location (self.test_run.output_path / "stdout.txt") even if the
file doesn't exist; reference both get_log_file and results_file in the comment
so future readers understand this predictable-accessor vs existence-check
distinction.

src/cloudai/workloads/megatron_run/report_generation_strategy.py

amaslenn · 2026-02-12T09:56:19Z

@juntaowww I just want to double check if we can do it now or you have some updates for this one. Let me know if we can merge it as is.

juntaowww · 2026-02-12T10:13:41Z

@juntaowww I just want to double check if we can do it now or you have some updates for this one. Let me know if we can merge it as is.

@amaslenn good for merging now.

cc @srivatsankrishnan we could iterate base on this version.

juntaowww added 3 commits January 22, 2026 10:44

Add MegatronRunReportGenerationStrategy

da3a0ea

Add tests

8b2dbea

Fix copyright year

c257022

juntaowww requested review from alexmanle, amaslenn, jeffnvidia and srivatsankrishnan as code owners January 22, 2026 09:07

Fix tests

27edf70

amaslenn reviewed Jan 22, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Outdated Show resolved Hide resolved

jeffnvidia reviewed Jan 22, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

juntaowww added 3 commits January 22, 2026 17:25

Change report to csv format

347aba8

Skip first 20 iters instead of keeping last 10

0efbc49

Fix overwrite behavior between test and scenario configuration

27ac52a

greptile-apps bot reviewed Jan 23, 2026

View reviewed changes

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py Show resolved Hide resolved

amaslenn reviewed Jan 23, 2026

View reviewed changes

src/cloudai/models/scenario.py Show resolved Hide resolved

juntaowww added 2 commits February 11, 2026 06:39

Add tests for override behavior

997213b

Revert "Skip first 20 iters instead of keeping last 10"

27f8c8e

Keeping last 10 iters for consistency This reverts commit 0efbc49.

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

amaslenn approved these changes Feb 11, 2026

View reviewed changes

amaslenn merged commit 3045293 into NVIDIA:main Feb 12, 2026
4 checks passed

Conversation

juntaowww commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

amaslenn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amaslenn commented Feb 12, 2026

Uh oh!

juntaowww commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juntaowww commented Jan 22, 2026 •

edited

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

greptile-apps bot commented Jan 22, 2026 •

edited

Loading