Skip to content

Conversation

@scotwells
Copy link
Contributor

Summary

This PR reorganizes metrics configuration and adds comprehensive readiness monitoring for Resource Manager, IAM, and Notification API groups.

Changes

Metrics reorganization

  • Move metrics from flat config/resources-metrics/ to hierarchical config/services/{api}/telemetry/metrics/control-plane/ structure
  • Add complete Notification API metrics (templates, contacts, groups, broadcasts, emails)
  • Add missing IAM metrics (user_deactivations, user_preferences)
  • Enhance all metrics with status conditions, generation tracking, common labels, and help text

Recording rules

  • Add rules tracking resources not in Ready state for Resource Manager, IAM, and Notification APIs
  • Rules evaluate every 15s to detect readiness issues quickly

Alerting rules

  • Add readiness alerts for Resource Manager, IAM, and Notification resources
  • Refactor quota alerts to focus on policy and resource readiness
  • Critical alerts fire after 1-2 minutes, warnings after 5-10 minutes

Structure

The new structure aligns with the quota system pattern:

config/services/{api}/telemetry/
├── kustomization.yaml
└── metrics/
    └── control-plane/
        ├── kustomization.yaml
        └── {resource}.yaml

Reorganize metrics from config/resources-metrics/ to follow the
config/services/{api}/telemetry/ structure established by quota system.
Add recording rules and alerts focused exclusively on resource readiness.

Metrics changes:
- Move Resource Manager and IAM metrics to hierarchical service structure
- Add complete Notification API metrics coverage
- Enhance all metrics with status conditions and generation tracking
- Add user_deactivations and user_preferences to IAM

Recording rules:
- Add rules tracking resources not in Ready state for Resource Manager,
  IAM, and Notification APIs
- Rules evaluate every 15s to detect readiness issues quickly

Alerts:
- Add readiness alerts for Resource Manager, IAM, and Notification
- Refactor quota alerts to focus on policy and resource readiness
- Critical alerts fire after 1-2min, warnings after 5-10min to allow
  reconciliation time
@joggrbot
Copy link
Contributor

joggrbot bot commented Oct 24, 2025

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: 7ec45d9 | Powered by Joggr

@scotwells scotwells force-pushed the feature/resource-metrics-and-alerting branch from 302c4c6 to 7ec45d9 Compare November 10, 2025 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants