Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Risk of Overqualified Candidates | Recruiting News Network

Indian Techie Uses Perplexity’s Comet Browser To Complete Coursera AI Course In Seconds; CEO Aravind Srinivas Responds

DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model – Takara TLDR

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Amazon AWS AI

Use Amazon SageMaker HyperPod and Anyscale for next-generation distributed computing

By Advanced AI EditorOctober 9, 2025No Comments10 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


This post was written with Dominic Catalano from Anyscale.

Organizations building and deploying large-scale AI models often face critical infrastructure challenges that can directly impact their bottom line: unstable training clusters that fail mid-job, inefficient resource utilization driving up costs, and complex distributed computing frameworks requiring specialized expertise. These factors can lead to unused GPU hours, delayed projects, and frustrated data science teams. This post demonstrates how you can address these challenges by providing a resilient, efficient infrastructure for distributed AI workloads.

Amazon SageMaker HyperPod is a purpose-built persistent generative AI infrastructure optimized for machine learning (ML) workloads. It provides robust infrastructure for large-scale ML workloads with high-performance hardware, so organizations can build heterogeneous clusters using tens to thousands of GPU accelerators. With nodes optimally co-located on a single spine, SageMaker HyperPod reduces networking overhead for distributed training. It maintains operational stability through continuous monitoring of node health, automatically swapping faulty nodes with healthy ones and resuming training from the most recently saved checkpoint, all of which can help save up to 40% of training time. For advanced ML users, SageMaker HyperPod allows SSH access to the nodes in the cluster, enabling deep infrastructure control, and allows access to SageMaker tooling, including Amazon SageMaker Studio, MLflow, and SageMaker distributed training libraries, along with support for various open-source training libraries and frameworks. SageMaker Flexible Training Plans complement this by enabling GPU capacity reservation up to 8 weeks in advance for durations up to 6 months.

The Anyscale platform integrates seamlessly with SageMaker HyperPod when using Amazon Elastic Kubernetes Service (Amazon EKS) as the cluster orchestrator. Ray is the leading AI compute engine, offering Python-based distributed computing capabilities to address AI workloads ranging from multimodal AI, data processing, model training, and model serving. Anyscale unlocks the power of Ray with comprehensive tooling for developer agility, critical fault tolerance, and an optimized version called RayTurbo, designed to deliver leading cost-efficiency. Through a unified control plane, organizations benefit from simplified management of complex distributed AI use cases with fine-grained control across hardware.

The combined solution provides extensive monitoring through SageMaker HyperPod real-time dashboards tracking node health, GPU utilization, and network traffic. Integration with Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana delivers deep visibility into cluster performance, complemented by Anyscale’s monitoring framework, which provides built-in metrics for monitoring Ray clusters and the workloads that run on them.

This post demonstrates how to integrate the Anyscale platform with SageMaker HyperPod. This combination can deliver tangible business outcomes: reduced time-to-market for AI initiatives, lower total cost of ownership through optimized resource utilization, and increased data science productivity by minimizing infrastructure management overhead. It is ideal for Amazon EKS and Kubernetes-focused organizations, teams with large-scale distributed training needs, and those invested in the Ray ecosystem or SageMaker.

Solution overview

The following architecture diagram illustrates SageMaker HyperPod with Amazon EKS orchestration and Anyscale.

End-to-end AWS Anyscale architecture depicting job submission, EKS pod orchestration, data access, and monitoring flow

The sequence of events in this architecture is as follows:

A user submits a job to the Anyscale Control Plane, which is the main user-facing endpoint.
The Anyscale Control Plane communicates this job to the Anyscale Operator within the SageMaker HyperPod cluster in the SageMaker HyperPod virtual private cloud (VPC).
The Anyscale Operator, upon receiving the job, initiates the process of creating the necessary pods by reaching out to the EKS control plane.
The EKS control plane orchestrates creation of a Ray head pod and worker pods. These pods represent a Ray cluster, running on SageMaker HyperPod with Amazon EKS.
The Anyscale Operator submits the job through the head pod, which serves as the primary coordinator for the distributed workload.
The head pod distributes the workload across multiple worker pods, as shown in the hierarchical structure in the SageMaker HyperPod EKS cluster.
Worker pods execute their assigned tasks, potentially accessing required data from the storage services – such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), or Amazon FSx for Lustre – in the user VPC.
Throughout the job execution, metrics and logs are published to Amazon CloudWatch and Amazon Managed Service for Prometheus or Amazon Managed Grafana for observability.
When the Ray job is complete, the job artifacts (final model weights, inference results, and so on) are saved to the designated storage service.
Job results (status, metrics, logs) are sent through the Anyscale Operator back to the Anyscale Control Plane.

This flow shows distribution and execution of user-submitted jobs across the available computing resources, while maintaining monitoring and data accessibility throughout the process.

Prerequisites

Before you begin, you must have the following resources:

Set up Anyscale Operator

Complete the following steps to set up the Anyscale Operator:

In your workspace, download the aws-do-ray repository:

git clone https://github.com/aws-samples/aws-do-ray.git
cd aws-do-ray/Container-Root/ray/anyscale

This repository has the commands needed to deploy the Anyscale Operator on a SageMaker HyperPod cluster. The aws-do-ray project aims to simplify the deployment and scaling of distributed Python application using Ray on Amazon EKS or SageMaker HyperPod. The aws-do-ray container shell is equipped with intuitive action scripts and comes preconfigured with convenient shortcuts, which save extensive typing and increase productivity. You can optionally use these features by building and opening a bash shell in the container with the instructions in the aws-do-ray README, or you can continue with the following steps.

If you continue with these steps, make sure your environment is properly set up:

Verify your connection to the HyperPod cluster:

Obtain the name of the EKS cluster on the SageMaker HyperPod console. In your cluster details, you will see your EKS cluster orchestrator.Active ml-cluster-eks details interface showing configuration, orchestrator settings, and management options
Update kubeconfig to connect to the EKS cluster:

aws eks update-kubeconfig –region –name my-eks-cluster

kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status $@

The following screenshot shows an example output.

Terminal view of Kubernetes nodes health check showing two ml.g5 instances with status and health details

If the output indicates InProgress instead of Passed, wait for the deep health checks to finish.

Terminal view of Kubernetes nodes health check showing two ml.g5 instances with differing scheduling statuses

Review the env_vars file. Update the variable AWS_EKS_HYPERPOD_CLUSTER. You can leave the values as default or make desired changes.
Deploy your requirements:

Execute:
./1.deploy-requirements.sh

This creates the anyscale namespace, installs Anyscale dependencies, configures login to your Anyscale account (this step will prompt you for additional verification as shown in the following screenshot), adds the anyscale helm chart, installs the ingress-nginx controller, and finally labels and taints SageMaker HyperPod nodes for the Anyscale worker pods.

Terminal showing Python environment setup with comprehensive package installation log and Anyscale login instructions

Create an EFS file system:

Execute:

./2.create-efs.sh

Amazon EFS serves as the shared cluster storage for the Anyscale pods.
At the time of writing, Amazon EFS and S3FS are the supported file system options when using Anyscale and SageMaker HyperPod setups with Ray on AWS. Although FSx for Lustre is not supported with this setup, you can use it with KubeRay on SageMaker HyperPod EKS.

Register an Anyscale Cloud:

Execute:

./3.register-cloud.sh

This registers a self-hosted Anyscale Cloud into your SageMaker HyperPod cluster. By default, it uses the value of ANYSCALE_CLOUD_NAME in the env_vars file. You can modify this field as needed. At this point, you will be able to see your registered cloud on the Anyscale console.

Deploy the Kubernetes Anyscale Operator:

Execute:

./4.deploy-anyscale.sh

This command installs the Anyscale Operator in the anyscale namespace. The Operator will start posting health checks to the Anyscale Control Plane.

To see the Anyscale Operator pod, run the following command:kubectl get pods -n anyscale

Submit training job

This section walks through a simple training job submission. The example implements distributed training of a neural network for Fashion MNIST classification using the Ray Train framework on SageMaker HyperPod with Amazon EKS orchestration, demonstrating how to use the AWS managed ML infrastructure combined with Ray’s distributed computing capabilities for scalable model training.Complete the following steps:

Navigate to the jobs directory. This contains folders for available example jobs you can run. For this walkthrough, go to the dt-pytorch directory containing the training job.

Configure the required environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
ANYSCALE_CLOUD_NAME

Create Anyscale compute configuration:
./1.create-compute-config.sh
Submit the training job:
./2.submit-dt-pytorch.shThis uses the job configuration specified in job_config.yaml. For more information on the job config, refer to JobConfig.
Monitor the deployment. You will see the newly created head and worker pods in the anyscale namespace.
kubectl get pods -n anyscale
View the job status and logs on the Anyscale console to monitor your submitted job’s progress and output.
Ray distributed training output displaying worker/driver logs, checkpoints, metrics, and configuration details for ML model training

Clean up

To clean up your Anyscale cloud, run the following command:

cd ../..
./5.remove-anyscale.sh

To delete your SageMaker HyperPod cluster and associated resources, delete the CloudFormation stack if this is how you created the cluster and its resources.

Conclusion

This post demonstrated how to set up and deploy the Anyscale Operator on SageMaker HyperPod using Amazon EKS for orchestration.SageMaker HyperPod and Anyscale RayTurbo provide a highly efficient, resilient solution for large-scale distributed AI workloads: SageMaker HyperPod delivers robust, automated infrastructure management and fault recovery for GPU clusters, and RayTurbo accelerates distributed computing and optimizes resource usage with no code changes required. By combining the high-throughput, fault-tolerant environment of SageMaker HyperPod with RayTurbo’s faster data processing and smarter scheduling, organizations can train and serve models at scale with improved reliability and significant cost savings, making this stack ideal for demanding tasks like large language model pre-training and batch inference.

For more examples of using SageMaker HyperPod, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Guide. For information on how customers are using RayTurbo, refer to RayTurbo.

 

About the authors

Sindhura Palakodety is a Senior Solutions Architect at AWS and Single-Threaded Leader (STL) for ISV Generative AI, where she is dedicated to empowering customers in developing enterprise-scale, Well-Architected solutions. She specializes in generative AI and data analytics domains, helping organizations use innovative technologies for transformative business outcomes.

Mark Vinciguerra is an Associate Specialist Solutions Architect at AWS based in New York. He focuses on generative AI training and inference, with the goal of helping customers architect, optimize, and scale their workloads across various AWS services. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering.

Florian Gauter is a Worldwide Specialist Solutions Architect at AWS, based in Hamburg, Germany. He specializes in AI/ML and generative AI solutions, helping customers optimize and scale their AI/ML workloads on AWS. With a background as a Data Scientist, Florian brings deep technical expertise to help organizations design and implement sophisticated ML solutions. He works closely with customers worldwide to transform their AI initiatives and maximize the value of their ML investments on AWS.

Alex Iankoulski is a Principal Solutions Architect in the Worldwide Specialist Organization at AWS. He focuses on orchestration of AI/ML workloads using containers. Alex is the author of the do-framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. Over the past 10 years, Alex has worked on helping customers do more on AWS, democratizing AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

Anoop Saha is a Senior GTM Specialist at AWS focusing on generative AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Dominic Catalano is a Group Product Manager at Anyscale, where he leads product development across AI/ML infrastructure, developer productivity, and enterprise security. His work focuses on distributed systems, Kubernetes, and helping teams run AI workloads at scale.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleReflection raises $2B to be America’s open frontier AI lab, challenging DeepSeek
Next Article SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models – Takara TLDR
Advanced AI Editor
  • Website

Related Posts

Customizing text content moderation with Amazon Nova

October 10, 2025

Vxceed builds the perfect sales pitch for sales teams at scale using Amazon Bedrock

October 8, 2025

Implement a secure MLOps platform based on Terraform and GitHub

October 8, 2025

Comments are closed.

Latest Posts

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

Latest Posts

Risk of Overqualified Candidates | Recruiting News Network

October 10, 2025

Indian Techie Uses Perplexity’s Comet Browser To Complete Coursera AI Course In Seconds; CEO Aravind Srinivas Responds

October 10, 2025

DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model – Takara TLDR

October 10, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Risk of Overqualified Candidates | Recruiting News Network
  • Indian Techie Uses Perplexity’s Comet Browser To Complete Coursera AI Course In Seconds; CEO Aravind Srinivas Responds
  • DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model – Takara TLDR
  • MIT rejects Trump admin offer to sign a ‘compact’ for a funding edge
  • Together AI's ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

Recent Comments

  1. ImperialCrownJ7Nalay on France’s Mistral unveils its first ‘reasoning’ AI model, ET Telecom
  2. über unter wetten basketball on Veo 3 demo | Owl and badger
  3. ImperialCrownJ7Nalay on Anthropic’s popular Claude Code AI tool now included in its $20/month Pro plan
  4. CrazySpinQ4Nalay on OpenAI countersues Elon Musk, calls for enjoinment from ‘further unlawful and unfair action’
  5. ImperialCrownJ7Nalay on The first Google TPU for the age of inference

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.