-
Notifications
You must be signed in to change notification settings - Fork 358
add blog on scale ray on AKS #5601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a new AKS blog post that describes how to run/scale Ray workloads on AKS with Anyscale, covering multi-region capacity, unified storage (BlobFuse2), and service-principal-based authentication. It also includes diagram assets (SVGs) and their Mermaid sources to illustrate the storage and authentication flows.
Changes:
- Add new blog post content for “Scaling Ray on AKS” with examples and architecture guidance.
- Add Mermaid source diagrams for storage and authentication flows, plus exported SVG versions.
- Add supporting screenshots used by the post.
Reviewed changes
Copilot reviewed 3 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| website/blog/2026-02-13-scaling-ray-aks/index.md | New blog post describing scaling Ray on AKS with multi-region, storage, and auth guidance |
| website/blog/2026-02-13-scaling-ray-aks/cluster-storage.mmd | Mermaid source for the cluster storage architecture diagram |
| website/blog/2026-02-13-scaling-ray-aks/cluster-storage.svg | Exported SVG for the cluster storage architecture diagram |
| website/blog/2026-02-13-scaling-ray-aks/auth-flow.mmd | Mermaid source for the service principal authentication flow diagram |
| website/blog/2026-02-13-scaling-ray-aks/auth-flow.svg | Exported SVG for the authentication flow diagram |
| description: "Learn how to run production-grade Ray workloads on Azure Kubernetes Service with multi-region support, unified storage, and secure authentication." | ||
| date: 2026-02-13 | ||
| authors: | ||
| - anson-qian |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This post is future-dated (2026-02-13). Docusaurus publishes future-dated posts immediately, so merging this PR will publish it right away. If this isn’t intended to go live yet, add draft: true (or unlisted: true) to the front matter before merging.
| - anson-qian | ||
| - bob-mital | ||
| - kenneth-kilty | ||
| categories: |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categories: is present but empty in the front matter. Other blog posts in this repo don’t use categories in front matter, so this is likely accidental and should be removed to avoid confusing/unused metadata.
| categories: |
| - **Operational simplicity** through automated credential management with service principal | ||
|
|
||
| Whether you're [fine-tuning models with DeepSpeed or LLaMA-Factory](https://github.com/Azure-Samples/aks-anyscale/tree/main/examples/finetuning) or [deploying inference endpoints for LLMs ranging from small to large-scale reasoning models](https://github.com/Azure-Samples/aks-anyscale/tree/main/examples/inferencing), this architecture delivers a production-grade ML platform that scales with your needs. | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This post is missing the <!-- truncate --> marker that most blog posts use to control the listing-page excerpt. Add it after the intro so the blog index doesn’t render the entire article preview.
| <!-- truncate --> |
| - **Improve fault tolerance**: If one region experiences an outage or capacity shortage, workloads can be automatically rerouted to healthy clusters | ||
| - **Scale beyond single-cluster limits**: Azure imposes quota limits on GPU instances per region, but multi-region deployments let you aggregate capacity | ||
|
|
||
| To add a cluster or another region to your existing Anyscale cloud, define a cloud resource ([cloud_resource.yaml](./aks-anyscale/cloud_resource.yaml)): |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link ./aks-anyscale/cloud_resource.yaml points to a file that isn’t present in this blog post directory, so it will be broken on the site. Update the link to the correct location (for example, the Azure-Samples repo path) or add the referenced file to the post assets.
| To add a cluster or another region to your existing Anyscale cloud, define a cloud resource ([cloud_resource.yaml](./aks-anyscale/cloud_resource.yaml)): | |
| To add a cluster or another region to your existing Anyscale cloud, define a cloud resource ([cloud_resource.yaml](https://github.com/Azure-Samples/aks-anyscale/blob/main/config/cloud_resource.yaml)): |
| -f "$CLOUD_RESOURCE_YAML" | ||
| ``` | ||
|
|
||
| With infrastructure deployed across multiple regions, you can manage and monitor Ray workloads from the Anyscale console. The single-pane-of-glass view shows all registered clusters and their available resources: |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“single-pane-of-glass” is typically written without hyphens (“single pane of glass”). Consider updating to match common Microsoft style guidance.
| With infrastructure deployed across multiple regions, you can manage and monitor Ray workloads from the Anyscale console. The single-pane-of-glass view shows all registered clusters and their available resources: | |
| With infrastructure deployed across multiple regions, you can manage and monitor Ray workloads from the Anyscale console. The single pane of glass view shows all registered clusters and their available resources: |
| --enable-blob-driver | ||
| ... |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This az aks create example is missing a line-continuation (\) after --enable-blob-driver, so the command won’t paste/run as written. Add the continuation (and consider replacing ... with a commented placeholder) to keep the snippet copy/pasteable.
| --enable-blob-driver | |
| ... | |
| --enable-blob-driver \ | |
| # ...additional flags as needed... |
| storage: 100Gi | ||
| ``` | ||
|
|
||
| 5. Configure Ray workloads read from and write to mounted blob path (`/mnt/cluster_storage`). |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 5 reads awkwardly (“Configure Ray workloads read from and write…”). Consider rephrasing to “Configure Ray workloads to read from and write to …” for correct grammar.
| 5. Configure Ray workloads read from and write to mounted blob path (`/mnt/cluster_storage`). | |
| 5. Configure Ray workloads to read from and write to the mounted blob path (`/mnt/cluster_storage`). |
| @@ -0,0 +1,166 @@ | |||
| --- | |||
| title: "Scaling Ray on AKS" | |||
| description: "Learn how to run production-grade Ray workloads on Azure Kubernetes Service with multi-region support, unified storage, and secure authentication." | |||
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The front-matter description is ~145 characters, which is shorter than the repo’s 150–160 character SEO guidance. Please expand it slightly so it lands in the 150–160 range.
| description: "Learn how to run production-grade Ray workloads on Azure Kubernetes Service with multi-region support, unified storage, and secure authentication." | |
| description: "Learn how to run production-grade Ray workloads on Azure Kubernetes Service with multi-region support, unified storage, and secure authentication for AI." |
| ```bash | ||
| anyscale cloud resource create \ | ||
| --cloud "$ANYSCALE_CLOUD_NAME" \ | ||
| -f "$CLOUD_RESOURCE_YAML" |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLI example uses -f "$CLOUD_RESOURCE_YAML", but the post never defines CLOUD_RESOURCE_YAML (and earlier names the file cloud_resource.yaml). Consider using the explicit path shown in the text or add a short snippet showing how to set CLOUD_RESOURCE_YAML.
| -f "$CLOUD_RESOURCE_YAML" | |
| -f ./aks-anyscale/cloud_resource.yaml |
No description provided.