Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod simplify how you can use the service’s distributed training and inference capabilities.
The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.
For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize your ML workflows. Developers can use the SDK’s Python interface to precisely configure training and deployment parameters while maintaining the simplicity of working with familiar Python objects.
In this post, we demonstrate how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod. We walk through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference, showcasing how these tools streamline the development of production-ready generative AI applications.
Prerequisites
To follow the examples in this post, you must have the following prerequisites:
Because the use cases that we demonstrate are about training and deploying LLMs with the SageMaker HyperPod CLI and SDK, you must also install the following Kubernetes operators in the cluster:
Install the SageMaker HyperPod CLI
First, you must install the latest version of the SageMaker HyperPod CLI and SDK (the examples in this post are based on version 3.1.0). From the local environment, run the following command (you can also install in a Python virtual environment):
This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (sagemaker-hyperpod>=3.1.0) to be able to use the relevant set of features. To verify if the CLI is installed correctly, you can run the hyp command and check the outputs:
The output will be similar to the following, and includes instructions on how to use the CLI:
For more information on CLI usage and the available commands and respective parameters, refer to the CLI reference documentation.
Set the cluster context
The SageMaker HyperPod CLI and SDK use the Kubernetes API to interact with the cluster. Therefore, make sure the underlying Kubernetes Python client is configured to execute API calls against your cluster by setting the cluster context.
Use the CLI to list the clusters available in your AWS account:
Set the cluster context specifying the cluster name as input (in our case, we use ml-cluster as ):
Train models with the SageMaker HyperPod CLI and SDK
The SageMaker HyperPod CLI provides a straightforward way to submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster. In the following example, we schedule a Meta Llama 3.1 8B model training job with FSDP.
The CLI executes training using the HyperPodPyTorchJob Kubernetes custom resource, which is implemented by the HyperPod training operator, that needs to be installed in the cluster as discussed in the prerequisites section.
First, clone the awsome-distributed-training repository and create the Docker image that you will use for the training job:
Then, log in to the Amazon Elastic Container Registry (Amazon ECR) to pull the base image and build the new container:
The Dockerfile in the awsome-distributed-training repository referenced in the preceding code already contains the HyperPod elastic agent, which orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. If you’re using a different Dockerfile, install the HyperPod elastic agent following the instructions in HyperPod elastic agent.
Next, create a new registry for your training image if needed and push the built image to it:
After you have successfully created the Docker image, you can submit the training job using the SageMaker HyperPod CLI.
Internally, the SageMaker HyperPod CLI will use the Kubernetes Python client to build a HyperPodPyTorchJob custom resource and then create it on the Kubernetes the cluster.
You can modify the CLI command for other Meta Llama configurations by exchanging the –args to the desired arguments and values; examples can be found in the Kubernetes manifests in the awsome-distributed-training repository.
In the given configuration, the training job will write checkpoints to /fsx/checkpoints on the FSx for Lustre PVC.
The hyp create hyp-pytorch-job command supports additional arguments, which can be discovered by running the following:
The preceding example code contains the following relevant arguments:
–command and –args offer flexibility in setting the command to be executed in the container. The command executed is hyperpodrun, implemented by the HyperPod elastic agent that is installed in the training container. The HyperPod elastic agent extends PyTorch’s ElasticAgent and manages the communication of the various workers with the HyperPod training operator. For more information, refer to HyperPod elastic agent.
–environment defines environment variables and customizes the training execution.
–max-retry indicates the maximum number of restarts at the process level that will be attempted by the HyperPod training operator. For more information, refer to Using the training operator to run jobs.
–volume is used to map persistent or ephemeral volumes to the container.
If successful, the command will output the following:
You can observe the status of the training job through the CLI. Running hyp list hyp-pytorch-job will show the status first as Created and then as Running after the containers have been started:
To list the pods that are created by this training job, run the following command:
You can observe the logs of one of the training pods that get spawned by running the following command:
We elaborate on more advanced debugging and observability features at the end of this section.
Alternatively, if you prefer a programmatic experience and more advanced customization options, you can submit the training job using the SageMaker HyperPod Python SDK. For more information, refer to the SDK reference documentation. The following code will yield the equivalent training job submission to the preceding CLI example:
Debugging training jobs
In addition to monitoring the training pod logs as described earlier, there are several other useful ways of debugging training jobs:
You can submit training jobs with an additional –debug True flag, which will print the Kubernetes YAML to the console when the job starts so it can be inspected by users.
You can view a list of current training jobs by running hyp list hyp-pytorch-job.
You can view the status and corresponding events of the job by running hyp describe hyp-pytorch-job —job-name fsdp-llama3-1-8b.
If the HyperPod observability stack is deployed to the cluster, run hyp get-monitoring –grafana and hyp get-monitoring –prometheus to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view cluster and job metrics.
To monitor GPU utilization or view directory contents, it can be useful to execute commands or open an interactive shell into the pods. You can run commands in a pod by running, for example, kubectl exec -it– nvtop to run nvtop for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it– /bin/bash.
The logs of the HyperPod training operator controller pod can have valuable information about scheduling. To view them, run kubectl get pods -n aws-hyperpod | grep hp-training-controller-manager to find the controller pod name and run kubectl logs -n aws-hyperpod to view the corresponding logs.
Deploy models with the SageMaker HyperPod CLI and SDK
The SageMaker HyperPod CLI provides commands to quickly deploy models to your SageMaker HyperPod cluster for inference. You can deploy both foundation models (FMs) available on Amazon SageMaker JumpStart as well as custom models with artifacts that are stored on Amazon S3 or FSx for Lustre file systems.
This functionality will automatically deploy the chosen model to the SageMaker HyperPod cluster through Kubernetes custom resources, which are implemented by the HyperPod inference operator, that needs to be installed in the cluster as discussed in the prerequisites section. It is optionally possible to automatically create a SageMaker inference endpoint as well as an Application Load Balancer (ALB), which can be used directly using HTTPS calls with a generated TLS certificate to invoke the model.
Deploy SageMaker JumpStart models
You can deploy an FM that is available on SageMaker JumpStart with the following command:
The preceding code includes the following parameters:
–model-id is the model ID in the SageMaker JumpStart model hub. In this example, we deploy a DeepSeek R1-distilled version of Qwen 1.5B, which is available on SageMaker JumpStart.
–instance-type is the target instance type in your SageMaker HyperPod cluster where you want to deploy the model. This instance type must be supported by the chosen model.
–endpoint-name is the name that the SageMaker inference endpoint will have. This name must be unique. SageMaker inference endpoint creation is optional.
–tls-certificate-output-s3-uri is the S3 bucket location where the TLS certificate for the ALB will be stored. This can be used to directly invoke the model through HTTPS. You can use S3 buckets that are accessible by the HyperPod inference operator IAM role.
–namespace is the Kubernetes namespace the model will be deployed to. The default value is set to default.
The CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which can be viewed by running the following command:
If successful, the command will output the following:
After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which can be observed through the CLI. Running hyp list hyp-jumpstart-endpoint will show the status first as DeploymentInProgress and then as DeploymentComplete when the endpoint is ready to be used:
To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:
The output will look similar to the following:
You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:
You will get an output similar to the following:
Alternatively, if you prefer a programmatic experience and advanced customization options, you can use the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:
Deploy custom models
You can also use the CLI to deploy custom models with model artifacts stored on either Amazon S3 or FSx for Lustre. This is useful for models that have been fine-tuned on custom data. You must provide the storage location of the model artifacts as well as a container image for inference that is compatible with the model artifacts and SageMaker inference endpoints. In the following example, we deploy a TinyLlama 1.1B model from Amazon S3 using the DJL Large Model Inference container image.
In preparation, download the model artifacts locally and push them to an S3 bucket:
Now you can deploy the model with the following command:
The preceding code contains the following key parameters:
–model-name is the name of the model that will be created in SageMaker
–model-source-type specifies either fsx or s3 for the location of the model artifacts
–model-location specifies the prefix or folder where the model artifacts are located
–s3-bucket-name and —s3-region specify the S3 bucket name and AWS Region, respectively
–instance-type, –endpoint-name, –namespace, and –tls-certificate behave the same as for the deployment of SageMaker JumpStart models
Similar to SageMaker JumpStart model deployment, the CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which you can view by running the following command:
If successful, the command will output the following:
After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which you can observe through the CLI. Running hyp list hyp-custom-endpoint will show the status first as DeploymentInProgress and as DeploymentComplete when the endpoint is ready to be used:
To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:
The output will look similar to the following:
You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:
You will get an output similar to the following:
Alternatively, you can deploy using the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:
Debugging inference deployments
In addition to the monitoring of the inference pod logs, there are several other useful ways of debugging inference deployments:
You can access the HyperPod inference operator controller logs through the SageMaker HyperPod CLI. Run hyp get-operator-logs—since-hours 0.5 to access the operator logs for custom and SageMaker JumpStart deployments, respectively.
You can view a list of inference deployments by running hyp list.
You can view the status and corresponding events of deployments by running hyp describe–name to view the status and events for custom and SageMaker JumpStart deployments, respectively.
If the HyperPod observability stack is deployed to the cluster, run hyp get-monitoring –grafana and hyp get-monitoring –prometheus to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view inference metrics as well.
To monitor GPU utilization or view directory contents, it can be useful to execute commands or open an interactive shell into the pods. You can run commands in a pod by running, for example, kubectl exec -it– nvtop to run nvtop for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it– /bin/bash.
For more information on the inference deployment features in SageMaker HyperPod, see Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle and Deploying models on Amazon SageMaker HyperPod.
Clean up
To delete the training job from the corresponding example, use the following CLI command:
To delete the model deployments from the inference example, use the following CLI commands for SageMaker JumpStart and custom model deployments, respectively:
To avoid incurring ongoing costs for the instances running in your cluster, you can scale down the instances or delete instances.
Conclusion
The new SageMaker HyperPod CLI and SDK can significantly streamline the process of training and deploying large-scale AI models. Through the examples in this post, we’ve demonstrated how these tools provide the following benefits:
Simplified workflows – The CLI offers straightforward commands for common tasks like distributed training and model deployment, making powerful capabilities of SageMaker HyperPod accessible to data scientists without requiring deep infrastructure knowledge.
Flexible development options – Although the CLI handles common scenarios, the SDK enables fine-grained control and customization for more complex requirements, so developers can programmatically configure every aspect of their distributed ML workloads.
Comprehensive observability – Both interfaces provide robust monitoring and debugging capabilities through system logs and integration with the SageMaker HyperPod observability stack, helping quickly identify and resolve issues during development.
Production-ready deployment – The tools support end-to-end workflows from experimentation to production, including features like automatic TLS certificate generation for secure model endpoints and integration with SageMaker inference endpoints.
Getting started with these tools is as simple as installing the sagemaker-hyperpod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.
For more information about SageMaker HyperPod and these development tools, refer to the SageMaker HyperPod CLI and SDK documentation or explore the example notebooks.
About the authors
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning platform team at AWS, leading the SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.
Nicolas Jourdan is a Specialist Solutions Architect at AWS, where he helps customers unlock the full potential of AI and ML in the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, where his research focused on the reliability, concept drift detection, and MLOps of industrial ML applications. Nicolas has extensive hands-on experience across industries, including autonomous driving, drones, and manufacturing, having worked in roles ranging from research scientist to engineering manager. He has contributed to award-winning research, holds patents in object detection and anomaly detection, and is passionate about applying cutting-edge AI to solve complex real-world problems.