Skip to content

Add report generation strategy for the MegatronRun#787

Merged
amaslenn merged 9 commits intoNVIDIA:mainfrom
juntaowww:megatron_run_report
Feb 12, 2026
Merged

Add report generation strategy for the MegatronRun#787
amaslenn merged 9 commits intoNVIDIA:mainfrom
juntaowww:megatron_run_report

Conversation

@juntaowww
Copy link
Contributor

@juntaowww juntaowww commented Jan 22, 2026

Summary

Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.

Also fixes the configuration overwrite behavior for agent_metrics in test and scenario toml setups. Originally if agent_metrics is set in test configuration but not in scenario configuration, the agent_metrics would be overwrited by [default,]. Now agent_metrics in test configuration could be correctly propagated when agent_metrics is not set in scenario configuration.

Test Plan

Adds following to the test configurations

agent_metrics = ["tflops-per-gpu"]
agent_reward_function = "identity"  # For maximizing TFLOP/s

# Or
# agent_metrics = ["iteration-time"] # For minimizing iteration time

megatron_run_report.csv would be generated and trajectory.csv would reflect the observations and rewards.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Adds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests.

Changes

Cohort / File(s) Summary
Report strategy implementation
src/cloudai/workloads/megatron_run/report_generation_strategy.py
New MegatronRunReportGenerationStrategy class. Parses stdout.txt for iteration-time (ms) and tflops-per-gpu, keeps recent iterations, computes avg/median/min/max/std, writes megatron_run_report.csv, exposes get_metric and helpers, and handles missing/invalid data.
Package exports
src/cloudai/workloads/megatron_run/__init__.py
Added MegatronRunReportGenerationStrategy to module exports (__all__) and updated copyright year.
Registration
src/cloudai/registration.py
Imported and registered MegatronRunReportGenerationStrategy for MegatronRunTestDefinition via Registry().add_report(...); minor copyright year bump.
Models
src/cloudai/models/scenario.py
TestRunModel.tdef_model_dump changed to emit agent_metrics only when present in the model's fields set (otherwise None).
Tests
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py, tests/test_test_scenario.py
New unit tests and fixtures for MegatronRun strategy (stdout samples, detection, CSV structure/content, metric extraction including invalid/no-data); test scenario updated to include the new strategy and agent_metrics coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I nibble logs with whiskers keen,
find iteration times and TFLOP/s unseen.
I skip the first hops, then average the rest,
a neat little CSV to show which run's best.
Hooray — the rabbit reports pass the test! ✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a report generation strategy for MegatronRun workloads.
Description check ✅ Passed The description clearly relates to the changeset, explaining the new report generation strategy for MegatronRun and fixes to agent_metrics configuration behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

Adds MegatronRunReportGenerationStrategy to extract iteration time and GPU TFLOP/s metrics from MegatronRun training logs, enabling Design Space Exploration (DSE) capabilities. The implementation follows the established pattern from MegatronBridgeReportGenerationStrategy with similar structure and error handling.

Key changes:

  • New report generation strategy parses stdout.txt for iteration metrics using regex pattern matching
  • Skips first 20 iterations to exclude warmup period (vs MegatronBridge's approach of keeping last 10)
  • Generates CSV reports with statistics (avg, median, min, max, std) for both iteration time (ms) and TFLOP/s per GPU
  • Supports three metric types: default, iteration-time, and tflops-per-gpu for DSE agent integration
  • Updates scenario.py to conditionally serialize agent_metrics only when explicitly set, preventing default values from being written
  • Comprehensive test coverage with multiple edge cases and validation scenarios

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation follows well-established patterns from MegatronBridge, has comprehensive test coverage including edge cases, properly handles errors, and all changes are additive without modifying existing functionality. The scenario.py change is a minor improvement to serialization logic that correctly uses Pydantic's model_fields_set.
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_run/report_generation_strategy.py Adds new MegatronRunReportGenerationStrategy class to extract iteration time and TFLOP/s metrics from stdout.txt, following similar patterns to MegatronBridge. Implementation looks solid with proper error handling and warmup iteration skipping.
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py Comprehensive test coverage for the new report generation strategy including happy path, edge cases, and metric extraction validation.
src/cloudai/models/scenario.py Updates tdef_model_dump to conditionally serialize agent_metrics only if explicitly set, preventing default values from being serialized.

Sequence Diagram

sequenceDiagram
    participant User
    participant System
    participant Registry
    participant MegatronRunRGS as MegatronRunReportGenerationStrategy
    participant TestRun
    participant LogFile as stdout.txt
    participant ReportFile as megatron_run_report.csv

    User->>System: Execute MegatronRun workload
    System->>TestRun: Create TestRun instance
    TestRun->>LogFile: Generate stdout.txt with iteration metrics
    
    User->>System: Request report generation
    System->>Registry: Get report strategies for MegatronRunTestDefinition
    Registry-->>System: Return [CheckpointTimingRGS, MegatronRunRGS]
    
    System->>MegatronRunRGS: can_handle_directory()
    MegatronRunRGS->>LogFile: Open and search for ITERATION_REGEX
    LogFile-->>MegatronRunRGS: Pattern found/not found
    MegatronRunRGS-->>System: Return true/false
    
    alt Can handle directory
        System->>MegatronRunRGS: generate_report()
        MegatronRunRGS->>LogFile: Extract iteration times and TFLOP/s
        LogFile-->>MegatronRunRGS: Raw metrics data
        MegatronRunRGS->>MegatronRunRGS: Skip first 20 iterations (warmup)
        MegatronRunRGS->>MegatronRunRGS: Calculate statistics (mean, median, min, max, pstdev)
        MegatronRunRGS->>ReportFile: Write CSV with metrics
        ReportFile-->>MegatronRunRGS: Success
        
        User->>System: Request metric value
        System->>MegatronRunRGS: get_metric("iteration-time" or "tflops-per-gpu")
        MegatronRunRGS->>LogFile: Re-extract data
        MegatronRunRGS->>MegatronRunRGS: Calculate mean
        MegatronRunRGS-->>System: Return metric value
        System-->>User: Metric value for DSE
    else Cannot handle directory
        MegatronRunRGS-->>System: Cannot handle
        System->>Registry: Try next strategy
    end
Loading

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the code freeze stage now, this PR will be merged later.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)

30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.

ITERATION_REGEX requires TFLOP/s, so logs that only print iteration time will be ignored; can_handle_directory() then returns False and get_metric("iteration-time") returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.

🔧 Proposed fix
-ITERATION_REGEX = re.compile(
-    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
-    r".*?"
-    r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)",
-    re.IGNORECASE,
-)
+ITERATION_REGEX = re.compile(
+    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
+    r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?",
+    re.IGNORECASE,
+)

 def _extract(self, log_path: Path) -> tuple[list[float], list[float]]:
     """Extract iteration times (ms) and GPU TFLOPS from the log file."""
-    iter_times_ms: list[float] = []
-    gpu_tflops: list[float] = []
+    records: list[tuple[float, float | None]] = []
     with log_path.open("r", encoding="utf-8", errors="ignore") as f:
         for line in f:
             m = ITERATION_REGEX.search(line)
             if m:
                 try:
-                    iter_times_ms.append(float(m.group(1)))
-                    gpu_tflops.append(float(m.group(2)))
+                    iter_time = float(m.group(1))
+                    tflops = float(m.group(2)) if m.group(2) is not None else None
+                    records.append((iter_time, tflops))
                 except (ValueError, TypeError):
                     logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n"))

     # Skip the first 20 iterations for statistics (to exclude warmup)
-    if len(iter_times_ms) > 20:
-        iter_times_ms = iter_times_ms[20:]
-        gpu_tflops = gpu_tflops[20:]
-    return iter_times_ms, gpu_tflops
+    if len(records) > 20:
+        records = records[20:]
+    iter_times_ms = [t for t, _ in records]
+    gpu_tflops = [g for _, g in records if g is not None]
+    return iter_times_ms, gpu_tflops
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.

In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 103-121: Extract the magic number 10 into a named class constant
for discoverability and future tuning: add something like _TAIL_ITERATIONS = 10
on the class, then replace the hardcoded checks and slices in _extract
(iter_times_ms and gpu_tflops handling) to use that constant (e.g., if
len(iter_times_ms) > self._TAIL_ITERATIONS: iter_times_ms =
iter_times_ms[-self._TAIL_ITERATIONS:] and similarly for gpu_tflops) so the
behavior is unchanged but the window size is configurable/readable.
- Around line 169-186: get_metric currently calls _get_extracted_data() every
time, causing repeated file I/O and parsing; modify _get_extracted_data (or add
a new helper) to cache its results so subsequent calls return the parsed tuple
without re-reading stdout.txt. You can implement caching either by decorating
_get_extracted_data with functools.lru_cache()/functools.cache (since it takes
only self) or by storing the parsed tuple on the instance (e.g.,
self._cached_extracted_data) and returning it if present; ensure get_metric
continues to call _get_extracted_data and that cache invalidation isn’t required
for immutable benchmark outputs.
- Around line 85-91: Add a brief explanatory comment above the results_file
property clarifying that get_log_file() returns an existing Path or None while
results_file intentionally always returns the canonical Path to the expected
"stdout.txt" location (self.test_run.output_path / "stdout.txt") even if the
file doesn't exist; reference both get_log_file and results_file in the comment
so future readers understand this predictable-accessor vs existence-check
distinction.

@amaslenn
Copy link
Contributor

@juntaowww I just want to double check if we can do it now or you have some updates for this one. Let me know if we can merge it as is.

@juntaowww
Copy link
Contributor Author

@juntaowww I just want to double check if we can do it now or you have some updates for this one. Let me know if we can merge it as is.

@amaslenn good for merging now.

cc @srivatsankrishnan we could iterate base on this version.

@amaslenn amaslenn merged commit 3045293 into NVIDIA:main Feb 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants