Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Google Veo3 Flow Tool Surpasses 100 Million AI-Generated Videos: Major Milestone for AI Video Creation | AI News Detail

Inline code nodes now supported in Amazon Bedrock Flows in public preview

Why OpenAI’s $500 Billion Valuation Proves The Opposite

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Amazon AWS AI

Fine-tune OpenAI GPT-OSS models using Amazon SageMaker HyperPod recipes

By Advanced AI EditorAugust 21, 2025No Comments18 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


This post is the second part of the GPT-OSS series focusing on model customization with Amazon SageMaker AI. In Part 1, we demonstrated fine-tuning GPT-OSS models using open source Hugging Face libraries with SageMaker training jobs, which supports distributed multi-GPU and multi-node configurations, so you can spin up high-performance clusters on demand.

In this post, we show how you can fine-tune GPT OSS models on using recipes on SageMaker HyperPod and Training Jobs. SageMaker HyperPod recipes help you get started with training and fine-tuning popular publicly available foundation models (FMs) such as Meta’s Llama, Mistral, and DeepSeek in just minutes, using either SageMaker HyperPod or training jobs. The recipes provide pre-built, validated configurations that alleviate the complexity of setting up distributed training environments while maintaining enterprise-grade performance and scalability for models. We outline steps to fine-tune the GPT-OSS model on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Thinking, so GPT-OSS can handle structured, chain-of-thought (CoT) reasoning across multiple languages.

Solution overview

This solution uses SageMaker HyperPod recipes to run a fine-tuning job on HyperPod using Amazon Elastic Kubernetes Service (Amazon EKS) orchestration or training jobs. Recipes are processed through the SageMaker HyperPod recipe launcher, which serves as the orchestration layer responsible for launching a job on the corresponding architecture such as SageMaker HyperPod (Slurm or Amazon EKS) or training jobs. To learn more, see SageMaker HyperPod recipes.

For details on fine-tuning the GPT-OSS model, see Fine-tune OpenAI GPT-OSS models on Amazon SageMaker AI using Hugging Face libraries.

In the following sections, we discuss the prerequisites for both options, and then move on to the data preparation. The prepared data is saved to Amazon FSx for Lustre, which is used as the persistent file system for SageMaker HyperPod, or Amazon Simple Storage Service (Amazon S3) for training jobs. We then use recipes to submit the fine-tuning job, and finally deploy the trained model to a SageMaker endpoint for testing and evaluating the model. The following diagram illustrates this architecture.

Prerequisites

To follow along, you must have the following prerequisites:

A local development environment with AWS credentials configured for creating and accessing SageMaker resources, or a remote environment such as Amazon SageMaker Studio.
For SageMaker HyperPod fine-tuning, complete the following:

For fine-tuning the model using SageMaker training jobs, you must have one ml.p5.48xlarge instance (with 8 x NVIDIA H100 GPUs) for training jobs usage. If you don’t have sufficient limits, request the following SageMaker quotas on the Service Quotas console: P5 instance (ml.p5.48xlarge) for training jobs (ml.p5.48xlarge for cluster usage): 1.

It might take up to 24 hours for these limits to be approved. You can also use SageMaker training plans to reserve these instances for a specific timeframe and use case (cluster or training jobs usage). For more details, see Reserve training plans for your training jobs or HyperPod clusters.

Next, use your preferred development environment to prepare the dataset for fine-tuning. You can find the full code in the Generative AI using Amazon SageMaker repository on GitHub.

Data tokenization

We use the Hugging FaceH4/Multilingual-Thinking dataset, which is a multilingual reasoning dataset containing CoT examples translated into languages such as French, Spanish, and German. The recipe supports a sequence length of 4,000 tokens for the GPT-OSS 120B model. The following example code demonstrates how to tokenize the multilingual-thinking dataset. The recipe accepts data in Hugging Face format (arrow). After it’s tokenized, you can save the processed dataset to disk.

from datasets import load_dataset

from transformers import AutoTokenizer
import numpy as np

dataset = load_dataset(“HuggingFaceH4/Multilingual-Thinking”, split=”train”)

tokenizer = AutoTokenizer.from_pretrained(“openai/gpt-oss-120b”)
messages = dataset[0][“messages”]
conversation = tokenizer.apply_chat_template(messages, tokenize=False)
print(conversation)

def preprocess_function(example):
   return tokenizer.apply_chat_template(example[‘messages’],
                                        return_dict=True,
                                        padding=”max_length”,
                                        max_length=4096,
                                        truncation=True)

def label(x):
   x[“labels”]=np.array(x[“input_ids”])
   x[“labels”][x[“labels”]==tokenizer.pad_token_id]=-100
   x[“labels”]=x[“labels”].tolist()
   return x
 
dataset = dataset.map(preprocess_function,
                      remove_columns=[‘reasoning_language’,
                                      ‘developer’,
                                      ‘user’,
                                      ‘analysis’,
                                      ‘final’,
                                      ‘messages’])
dataset = dataset.map(label)

# for HyperPod, save to mounted FSx volume
dataset.save_to_disk(“/fsx/multilingual_4096”)

# for training jobs, save to S3
dataset.save_to_disk(“multilingual_4096″)

def upload_directory(local_dir, bucket_name, s3_prefix=”):
    s3_client = boto3.client(‘s3’)
    
    for root, dirs, files in os.walk(local_dir):
        for file in files:
            local_path = os.path.join(root, file)
            # Calculate relative path for S3
            relative_path = os.path.relpath(local_path, local_dir)
            s3_path = os.path.join(s3_prefix, relative_path).replace(“\\”, “/”)
            
            print(f”Uploading {local_path} to {s3_path}”)
            s3_client.upload_file(local_path, bucket_name, s3_path)

upload_directory(‘./multilingual_4096/’, , ‘multilingual_4096’)

Now that you have prepared and tokenized the dataset, you can fine-tune the GPT-OSS model on your dataset, using either SageMaker HyperPod or training jobs. SageMaker training jobs are ideal for one-off or periodic training workloads that need temporary compute resources, making it a fully managed, on-demand experience for your training needs. SageMaker HyperPod is optimal for continuous development and experimentation, providing a persistent, preconfigured, and failure-resilient cluster. Depending on your choice, skip to the appropriate section for next steps.

Fine-tune the model using SageMaker HyperPod

To fine-tune the model using HyperPod, start by setting up the virtual environment and installing the necessary dependencies to execute the training job on the EKS cluster. Make sure the cluster is InService before proceeding, and you’re using Python 3.9 or greater in your development environment.

python3 -m venv ${PWD}/venv
source venv/bin/activate

Next, download and set up the SageMaker HyperPod recipes repository:

git clone –recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt 

You can now use the SageMaker HyperPod recipe launch scripts to submit your training job. Using the recipe involves updating the k8s.yaml configuration file and executing the launch script.

In recipes_collection/cluster/k8s.yaml, update the persistent_volume_claims section. It mounts the FSx claim to the /fsx directory of each computing pod:

– claimName: fsx-claim    
  mountPath: fsx

SageMaker HyperPod recipes provide a launch script for each recipe within the launcher_scripts directory. To fine-tune the GPT-OSS-120B model, update the launch scripts located at launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh and update the cluster_type parameter.

The updated launch script should look similar to the following code when running SageMaker HyperPod with Amazon EKS. Make sure that cluster=k8s and cluster_type=k8s are updated in the launch script:

#!/bin/bash

# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

#Users should setup their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-“$(pwd)”}

HF_MODEL_NAME_OR_PATH=”openai/gpt-oss-120b” # HuggingFace pretrained model name or path

TRAIN_DIR=”/fsx/multilingual_4096″ # Location of training dataset
VAL_DIR=”/fsx/multilingual_4096″ # Location of validation dataset

EXP_DIR=”/fsx/experiment” # Location to save experiment info including logging, checkpoints, ect
HF_ACCESS_TOKEN=”hf_xxxxxxxx” # Optional HuggingFace access token

HYDRA_FULL_ERROR=1 python3 “${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py” \
    recipes=fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora \
    container=”658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8″ \
    base_results_dir=”${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results” \
    recipes.run.name=”hf-gpt-oss-120b-lora” \
cluster=k8s \ # Imp: add cluster line when running on HP EKS
cluster_type=k8s \ # Imp: add cluster_type line when running on HP EKS
    recipes.exp_manager.exp_dir=”$EXP_DIR” \
    recipes.trainer.num_nodes=1 \
    recipes.model.data.train_dir=”$TRAIN_DIR” \
    recipes.model.data.val_dir=”$VAL_DIR” \
    recipes.model.hf_model_name_or_path=”$HF_MODEL_NAME_OR_PATH” \
    recipes.model.hf_access_token=”$HF_ACCESS_TOKEN” \

When the script is ready, you can launch fine-tuning of the GPT OSS 120B model using the following code:

chmod +x launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh
bash launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh

After submitting a job for fine-tuning, you can use the following command to verify successful submission. You should be able to see the pods running in your cluster:

kubectl get pods
NAME                                READY  STATUS   RESTARTS   AGE
hf-gpt-oss-120b-lora-h2cwd-worker-0 1/1    Running  0          14m

To check logs for the job, you can use the kubectl logs command:

kubectl logs -f hf-gpt-oss-120b-lora-h2cwd-worker-0

You should be able to see the following logs when the training begins and completes. You will find the checkpoints written to the /fsx/experiment/checkpoints folder.

warnings.warn(
    
Epoch 0:  40%|████      | 50/125 [08:47<13:10,  0.09it/s, Loss/train=0.254, Norms/grad_norm=0.128, LR/learning_rate=2.2e-6] [NeMo I 2025-08-18 17:49:48 nemo_logging:381] save SageMakerCheckpointType.PEFT_FULL checkpoint: /fsx/experiment/checkpoints/peft_full/steps_50
[NeMo I 2025-08-18 17:49:48 nemo_logging:381] Saving PEFT checkpoint to /fsx/experiment/checkpoints/peft_full/steps_50
[NeMo I 2025-08-18 17:49:49 nemo_logging:381] Loading Base model from : openai/gpt-oss-120b
You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|██████████| 15/15 [01:49<00:00,  7.33s/it]
[NeMo I 2025-08-18 17:51:39 nemo_logging:381] Merging the adapter, this might take a while……
Unloading and merging model: 100%|██████████| 547/547 [00:07<00:00, 71.27it/s]
[NeMo I 2025-08-18 17:51:47 nemo_logging:381] Checkpointing to /fsx/experiment/checkpoints/peft_full/steps_50/final-model……
[NeMo I 2025-08-18 18:00:14 nemo_logging:381] Successfully save the merged model checkpoint.
`Trainer.fit` stopped: `max_steps=50` reached.
Epoch 0:  40%|████      | 50/125 [23:09<34:43,  0.04it/s, Loss/train=0.264, Norms/grad_norm=0.137, LR/learning_rate=2e-6]  

When the training is complete, the final merged model can be found in the experiment directory path you defined in the launcher script under /fsx/experiment/checkpoints/peft_full/steps_50/final-model.

Fine-tune using SageMaker training jobs

You can also use recipes directly with SageMaker training jobs using the SageMaker Python SDK. The training jobs automatically spin up the compute, load the input data, run the training script, save the model to your output location, and tear down the instances, for a smooth training experience.

The following code snippet shows how to use recipes with the PyTorch estimator. You can use the training_recipe parameter to specify the training or fine-tuning recipe to be used, and recipe_overrides for any parameters that need replacement. For training jobs, update the input, output, and results directories to locations in /opt/ml as required by SageMaker training jobs.

import os
import sagemaker,boto3
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import FileSystemInput

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
output = os.path.join(f”s3://{bucket}”, “output”)

recipe_overrides = {
    “run”: {
        “results_dir”: “/opt/ml/model”,
    },
    “exp_manager”: {
        “exp_dir”: “”,
        “explicit_log_dir”: “/opt/ml/output/tensorboard”,
        “checkpoint_dir”: “/opt/ml/checkpoints”,
    },
    “model”: {
        “data”: {
            “train_dir”: “/opt/ml/input/data/train”,
            “val_dir”: “/opt/ml/input/data/val”,
        },
    },
    “use_smp_model”: “False”,
}

# create the estimator object
estimator = PyTorch(
  output_path=output,
  base_job_name=f”gpt-oss-recipe”,
  role=role,
  instance_type=”ml.p5.48xlarge”,
  training_recipe=”fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora”,
  recipe_overrides=recipe_overrides,
  sagemaker_session=sagemaker_session,
  image_uri=”658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8″,
)

# submit the training job
estimator.fit(
inputs={
“train”: f”s3://{bucket}/datasets/multilingual_4096/”,
“val”: f”s3://{bucket}/datasets/multilingual_4096/”}, wait=True)

After the job is submitted, you can monitor the status of your training job on the SageMaker console, by choosing Training jobs under Training in the navigation pane. Choose the training job that starts with gpt-oss-recipe to view its details and logs. When the training job is complete, the outputs will be saved to an S3 location. You can get the location of the output artifacts from the S3 model artifact section on the job details page.

Run inference

After you fine-tune your GPT-OSS model with SageMaker recipes on either SageMaker training jobs or SageMaker HyperPod, the output is a customized model artifact that merges the base model with the customized PEFT adapters. This final model is stored in Amazon S3 and can be deployed directly from Amazon S3 to SageMaker endpoints for real-time inference.

To serve GPT-OSS models, you must have the latest vLLM containers (v0.10.1 or later). A full list of vllm-openai Docker image versions is available on Docker hub.

The steps to deploy your fine-tuned GPT-OSS model are outlined in this section.

Build the latest GPT-OSS container for your SageMaker endpoint

If you’re deploying the model from SageMaker Studio using JupyterLab or the Code Editor, both environments come with Docker preinstalled. Make sure that you’re using the SageMaker Distribution image v3.0 or later for compatibility.You can build your deployment container by running the following commands:

%%bash # <- use this if you’re running this inside JupterLab cell

# navigate to deploy dir from the current workdir, to build container
cd ./deploy

# build a push container
chmod +X build.sh
bash build.sh

cd .. 

If you’re running these commands from a local terminal or other environment, simply omit the %%bash line and run the commands as standard shell commands.

The build.sh script is responsible for automatically building and pushing a vllm-openai container that is optimized for SageMaker endpoints. After it’s built, the custom SageMaker endpoint compatible vllm image is pushed to Amazon Elastic Container Registry (Amazon ECR). SageMaker endpoints can then pull this image from Amazon ECR at runtime to spin up the container for inference.

The following is an example of the build.sh script:

export REGION={region}
export ACCOUNT_ID={account_id}
export REPOSITORY_NAME=vllm
export TAG=v0.10.1

full_name=”${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPOSITORY_NAME}:${TAG}”

echo “building $full_name”

DOCKER_BUILDKIT=0 docker build . –network sagemaker –tag $full_name –file Dockerfile

aws ecr get-login-password –region $REGION | docker login –username AWS –password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# If the repository doesn’t exist in ECR, create it.
aws ecr describe-repositories –region ${REGION} –repository-names “${REPOSITIRY_NAME}” > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository –region ${REGION} –repository-name “${REPOSITORY_NAME}” > /dev/null
fi

docker tag $REPOSITORY_NAME:$TAG ${full_name}
docker push ${full_name}

The Dockerfile defines how we convert an open source vLLM Docker image into a SageMaker hosting-compatible image. This involves extending the base vllm-openai image, adding the serve entrypoint script, and making it executable. See the following example Dockerfile:

FROM vllm/vllm-openai:v0.10.1

COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ “/usr/bin/serve” ]

The serve script acts as a translation layer between SageMaker hosting conventions and the vLLM runtime. You can maintain the same deployment workflow you’re familiar with when hosting models on SageMaker endpoints, while automatically converting SageMaker-specific configurations into the format expected by vLLM.

Key points to note about this script:

It enforces the use of port 8080, which SageMaker requires for inference containers
It dynamically translates environment variables prefixed with OPTION_ into CLI arguments for vLLM (for example, OPTION_MAX_MODEL_LEN=4096 changes to –max-model-len 4096)
It prints the final set of arguments for visibility
It finally launches the vLLM API server with the translated arguments

The following is an example serve script:

#!/bin/bash

# Define the prefix for environment variables to look for
PREFIX=”OPTION_”
ARG_PREFIX=”–”

# Initialize an array for storing the arguments
# port 8080 required by sagemaker, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
ARGS=(–port 8080)

# Loop through all environment variables
while IFS=’=’ read -r key value; do
    # Remove the prefix from the key, convert to lowercase, and replace underscores with dashes
    arg_name=$(echo “${key#”${PREFIX}”}” | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘_’ ‘-‘)

    # Add the argument name and value to the ARGS array
    ARGS+=(“${ARG_PREFIX}${arg_name}”)
    if [ -n “$value” ]; then
        ARGS+=(“$value”)
    fi
done < <(env | grep “^${PREFIX}”)

echo “——————————————————————-”
echo “vLLM engine args: [${ARGS[@]}]”
echo “——————————————————————-”

# Pass the collected arguments to the main entrypoint
exec python3 -m vllm.entrypoints.openai.api_server “${ARGS[@]}”

Host customized GPT-OSS as a SageMaker real-time endpoint

Now you can deploy your fine-tuned GPT-OSS model using the ECR image URI you built in the previous step. In this example, the model artifacts are stored securely in an S3 bucket, and SageMaker will download them into the container at runtime.Complete the following configurations:

Set model_data to point to the S3 prefix where your model artifacts are located
Set the OPTION_MODEL environment variable to /opt/ml/model, which is where SageMaker mounts the model inside the container
(Optional) If you’re serving a model from Hugging Face Hub instead of Amazon S3, you can set OPTION_MODEL directly to the Hugging Face model ID instead

The endpoint startup might take several minutes as the model artifacts are downloaded and the container is initialized.The following is an example deployment code:

inference_image = f”{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.1″

…
…

lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env={
        “OPTION_MODEL”: “/opt/ml/model”, # set this to let SM endpoint read a model stored in s3, else set it to HF MODEL ID
        “OPTION_SERVED_MODEL_NAME”: “model”,
        “OPTION_TENSOR_PARALLEL_SIZE”: json.dumps(num_gpus),
        “OPTION_DTYPE”: “bfloat16″,
        #”VLLM_ATTENTION_BACKEND”: “TRITON_ATTN_VLLM_V1”, # not required for vLLM 0.10.1 and above
        “OPTION_ASYNC_SCHEDULING”: “true”,
        “OPTION_QUANTIZATION”: “mxfp4”
    },
    role=role,
    name=model_name,
    model_data={
        ‘S3DataSource’: {
            ‘S3Uri’: “s3://path/to/gpt-oss/model/artifacts”,
            ‘S3DataType’: ‘S3Prefix’,
            ‘CompressionType’: ‘None’
        }
    },
)

…

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={“num_accelerators”: 1, “memory”: 1024*3, “copies”: 1,}),
)

Sample inference

After your endpoint is deployed and in the InService state, you can invoke your fine-tuned GPT-OSS model using the SageMaker Python SDK.

The following is an example predictor setup:

pretrained_predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(boto3.Session(region_name=boto3.Session().region_name)),
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
    component_name=inference_component_name
)

The modified vLLM container is fully compatible with the OpenAI-style messages input format, making it straightforward to send chat-style requests:

payload = {
    “messages”: [{“role”: “user”, “content”: “Hello who are you?”}],
    “parameters”: {“max_new_tokens”: 64, “temperature”: 0.2}
}

output = pretrained_predictor.predict(payload)

You have successfully deployed and invoked your custom fine-tuned GPT-OSS model on SageMaker real-time endpoints, using the vLLM framework for optimized, low-latency inference. You can find more GPT-OSS hosting examples in the OpenAI gpt-oss examples GitHub repo.

Clean up

To avoid incurring additional charges, complete the following steps to clean up the resources used in this post:

Delete the SageMaker endpoint:

pretrained_predictor.delete_endpoint()

If you created a SageMaker HyperPod cluster for the purposes of this post, delete the cluster by following the instructions in Deleting a SageMaker HyperPod cluster.
Clean up the FSx for Lustre volume if it’s no longer needed by following instructions in Deleting a file system.
If you used training jobs, the training instances are automatically deleted when the jobs are complete.

Conclusion

In this post, we showed how to fine-tune OpenAI’s GPT-OSS models (gpt-oss-120b and gpt-oss-20b) on SageMaker AI using SageMaker HyperPod recipes. We discussed how SageMaker HyperPod recipes provide a powerful yet accessible solution for organizations to scale their AI model training capabilities with large language models (LLMs) including GPT-OSS, using either a persistent cluster through SageMaker HyperPod, or an ephemeral cluster using SageMaker training jobs. The architecture streamlines complex distributed training workflows through its intuitive recipe-based approach, reducing setup time from weeks to minutes. We also showed how these fine-tuned models can be seamlessly deployed to production using SageMaker endpoints with vLLM optimization, providing enterprise-grade inference capabilities with OpenAI-compatible APIs. This end-to-end workflow, from training to deployment, helps organizations build and serve custom LLM solutions while using the scalable infrastructure of AWS and comprehensive ML platform capabilities of SageMaker.

To begin using the SageMaker HyperPod recipes, visit the Amazon SageMaker HyperPod recipes GitHub repo for comprehensive documentation and example implementations. If you’re interested in exploring the fine-tuning further, the Generative AI using Amazon SageMaker GitHub repo has the necessary code and notebooks. Our team continues to expand the recipe ecosystem based on customer feedback and emerging ML trends, making sure that you have the tools needed for successful AI model training.

Special thanks to everyone who contributed to the launch: Hengzhi Pei, Zach Kimberg, Andrew Tian, Leonard Lausen, Sanjay Dorairaj, Manish Agarwal, Sareeta Panda, Chang Ning Tsai, Maxwell Nuyens, Natasha Sivananjaiah, and Kanwaljit Khurmi.

About the authors

author-surydurgDurga Sury is a Senior Solutions Architect at Amazon SageMaker, where she helps enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.

Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.

Sumedha Swamy is a Senior Manager of Product Management at Amazon Web Services (AWS), where he leads several areas of the Amazon SageMaker, including SageMaker Studio – the industry-leading integrated development environment for machine learning, developer and administrator experiences, AI infrastructure, and SageMaker SDK.

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleOpenAI GPT-5 vs GPT-6 : Key Features and Future Impacts
Next Article From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models – Takara TLDR
Advanced AI Editor
  • Website

Related Posts

Inline code nodes now supported in Amazon Bedrock Flows in public preview

August 21, 2025

How Infosys Topaz leverages Amazon Bedrock to transform technical help desk operations

August 21, 2025

Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)

August 21, 2025

Comments are closed.

Latest Posts

French Art Historian Trying to Block Bayeux Tapestry’s Move to London

Czech Man Sues Christie’s For Information on Nazi-Looted Artworks

Tanya Bonakdar Gallery to Close Los Angeles Space

Ancient Silver Coins Suggest New History of Trading in Southeast Asia

Latest Posts

Google Veo3 Flow Tool Surpasses 100 Million AI-Generated Videos: Major Milestone for AI Video Creation | AI News Detail

August 22, 2025

Inline code nodes now supported in Amazon Bedrock Flows in public preview

August 21, 2025

Why OpenAI’s $500 Billion Valuation Proves The Opposite

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Google Veo3 Flow Tool Surpasses 100 Million AI-Generated Videos: Major Milestone for AI Video Creation | AI News Detail
  • Inline code nodes now supported in Amazon Bedrock Flows in public preview
  • Why OpenAI’s $500 Billion Valuation Proves The Opposite
  • IMF Says AI Will Upend Jobs and Boost Inequality. MIT CSAIL Says Not Fast.
  • Exclusive: Walmart’s CISO is rebuilding identity security for AI age

Recent Comments

  1. kosmetologicheskoe-oborudovanie-963 on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. Grovervot on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. HihisnHox on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. Charliecep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Grovervot on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.