-
Notifications
You must be signed in to change notification settings - Fork 53
Description
GitHub Issue: ACI Pipeline Agent PAT Token Loss - Investigation and Mitigation
Repository
Azure/alz-terraform-accelerator
Issue Title
[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff
Issue Body
Describe the bug
Azure Container Instance (ACI) self-hosted Azure DevOps pipeline agents deployed via the ALZ Terraform Accelerator bootstrap can unexpectedly lose their AZP_TOKEN secure environment variable, causing the containers to enter CrashLoopBackOff state and pipeline jobs to remain stuck in queue.
Versions
- ALZ Terraform Accelerator version: v4.8.0
- AzureRM Provider version: (from bootstrap)
- Terraform version: (from bootstrap)
- Azure region: Germany West Central
Steps to reproduce
- Deploy ALZ Terraform Accelerator bootstrap with Azure DevOps and ACI agents
- Verify agents are working (containers running, "Listening for Jobs")
- Wait for an unknown period (weeks/months)
- Observe containers entering CrashLoopBackOff with error:
1. AZP_TOKEN must be set
Investigation Findings
Symptoms
- Container logs show:
1. AZP_TOKEN must be set - Azure DevOps shows:
Job pending. Waiting at position 1 in queue. - Container is in CrashLoopBackOff with high restart count
Root Cause Analysis
When querying the container environment variables via Azure CLI:
{
"name": "AZP_TOKEN",
"secureValue": null,
"value": null
}Both secureValue AND value are null. This is NOT the normal case where secureValue is hidden - the token is genuinely missing.
Why Terraform Doesn't Detect This
Azure API never returns secure values in responses. The Terraform provider cannot detect if a secure_environment_variable has been cleared on the Azure side because:
- Terraform stores the value in state (encrypted)
- Azure returns
nullfor secure values (always, by design) - Terraform sees
nulland assumes it matches (can't compare) - No drift is detected,
terraform planshows no changes
Suspected Root Cause: Azure Host Rehosting
We strongly suspect this issue is caused by Azure rehosting the container to a different underlying host. Microsoft documentation confirms this can happen:
"customers may experience restarts initiated by the ACI infrastructure due to maintenance events"
"Although rare, there are some Azure-internal events that can cause redeployment to a different host."
When Azure moves a container group to a new host (due to maintenance, hardware failure, or capacity balancing), the secure environment variables may not be properly preserved during the migration.
Why This Is Difficult to Replicate
This issue is extremely difficult to reproduce because:
- Rehosting is an Azure-internal operation - Users cannot trigger it manually
- It happens rarely and unpredictably - Could take weeks or months
- No visibility - There's no Azure API to check which host a container is running on
- Activity logs expire - 90-day retention means evidence is lost before discovery
- Normal restarts work fine - Only rehosting to a different host causes the issue
We explicitly tested normal restarts (az container restart) and confirmed the PAT token was preserved. The issue only manifests when Azure moves the container to a different host.
Verified: Normal Restarts Preserve PAT
We tested this explicitly:
az container restart --name <container> --resource-group <rg>After restart, the container successfully reconnected with the PAT intact. Normal restarts do NOT cause this issue.
Suggested Mitigations
Option 1: Document the Limitation
Add documentation warning users that:
- Terraform cannot detect secure environment variable drift
- Users should monitor for CrashLoopBackOff
- Re-running
terraform applywith the PAT variable will NOT fix the issue (needs explicit recreation)
Option 2: Use Azure Key Vault
Modify the bootstrap to:
- Store the PAT in Azure Key Vault
- Have the container retrieve the PAT at startup via managed identity
- This way, even if the container is recreated, it can always fetch the current secret
Option 3: Add Monitoring/Alerting
Include Azure Monitor alerts for:
- Container restart count > threshold
- Container state = "Waiting" or "CrashLoopBackOff"
Option 4: Lifecycle Ignore + External Management
Use lifecycle { ignore_changes = [containers[0].secure_environment_variables] } and manage the secret externally.
Workaround
To fix affected containers, you must delete and recreate them with the PAT:
# Delete existing containers
az container delete --name <container-name> --resource-group <rg> --yes
# Recreate with PAT
az container create --name <container-name> \
--resource-group <rg> \
--image <image> \
--secure-environment-variables AZP_TOKEN=<pat> \
--environment-variables AZP_URL=<url> AZP_POOL=<pool> AZP_AGENT_NAME=<name> \
# ... other parametersOr use a Bicep/ARM template that explicitly sets the secureValue.
Verification Command
To check if the PAT is working (from inside the container):
az container exec --name <container> --resource-group <rg> --exec-command "printenv AZP_TOKEN"If this returns empty or fails, the PAT is missing.
Additional Context
- The containers were originally deployed in August 2025
- The issue was discovered in December 2025 (4+ months later)
- Activity logs only retain 90 days, so we cannot see what Azure operations occurred
- Both containers in different availability zones were affected simultaneously
References
- ACI Troubleshooting - Isolated Restarts
- ACI Update Limitations
- Terraform AzureRM Issue #8096 - Secure environment variables
How to Create the Issue
- Go to: https://github.com/Azure/alz-terraform-accelerator/issues/new
- Copy the content above (from "### Describe the bug" onwards)
- Use the title:
[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff - Add labels:
bug,documentation(if available) - Submit