Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

OpenAI Acknowledges the Teen Problem

MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus

Huawei announces new AI infrastructure as Nvidia gets locked out of China

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Amazon AWS AI

Use AWS Deep Learning Containers with Amazon SageMaker AI managed MLflow

By Advanced AI EditorSeptember 18, 2025No Comments9 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Organizations building custom machine learning (ML) models often have specialized requirements that standard platforms can’t accommodate. For example, healthcare companies need specific environments to protect patient data while meeting HIPAA compliance, financial institutions require specific hardware configurations to optimize proprietary trading algorithms, and research teams need flexibility to experiment with cutting-edge techniques using custom frameworks. These specialized needs drive organizations to build custom training environments that give them control over hardware selection, software versions, and security configurations.

These custom environments provide the necessary flexibility, but they create significant challenges for ML lifecycle management. Organizations typically try to solve these problems by building additional custom tools, and some teams piece together various open source solutions. These approaches further increase operational costs and require engineering resources that could be better used elsewhere.

AWS Deep Learning Containers (DLCs) and managed MLflow on Amazon SageMaker AI offer a powerful solution that addresses both needs. DLCs provide preconfigured Docker containers with frameworks like TensorFlow and PyTorch, including NVIDIA CUDA drivers for GPU support. DLCs are optimized for performance on AWS, regularly maintained to include the latest framework versions and patches, and designed to integrate seamlessly with AWS services for training and inference. AWS Deep Learning AMIs (DLAMIs) are preconfigured Amazon Machine Images (AMIs) for Amazon Elastic Compute Cloud (Amazon EC2) instances. DLAMIs come with popular deep learning frameworks like PyTorch and TensorFlow, and are available for CPU-based instances and high-powered GPU-accelerated instances. They include NVIDIA CUDA, cuDNN, and other necessary tools, with AWS managing the updates of DLAMIs. Together, DLAMIs and DLCs provide ML practitioners with the infrastructure and tools to accelerate deep learning in the cloud at scale.

SageMaker managed MLflow delivers comprehensive lifecycle management with one-line automatic logging, enhanced comparison capabilities, and complete lineage tracking. As a fully managed service on SageMaker AI, it alleviates the operational burden of maintaining tracking infrastructure.

In this post, we show how to integrate AWS DLCs with MLflow to create a solution that balances infrastructure control with robust ML governance. We walk through a functional setup that your team can use to meet your specialized requirements while significantly reducing the time and resources needed for ML lifecycle management.

Solution overview

In this section, we describe the architecture and AWS services used to integrate AWS DLCs with SageMaker managed MLflow to implement the solution.The solution uses several AWS services together to create a scalable environment for ML development:

AWS DLCs provide preconfigured Docker images with optimized ML frameworks
SageMaker managed MLflow introduces enhanced model registry capabilities with fine-grained access controls and adds generative AI support through specialized tracking for LLM experiments and prompt management
Amazon Elastic Container Registry (Amazon ECR) stores and manages container images
Amazon Simple Storage Service (Amazon S3) stores input and output artifacts
Amazon EC2 runs the AWS DLCs

For this use case, you will develop a TensorFlow neural network model for abalone age prediction with integrated SageMaker managed MLflow tracking code. Next, you will pull an optimized TensorFlow training container from the AWS public ECR repository and configure an EC2 instance with access to the MLflow tracking server. You will then execute the training process within the DLC while storing model artifacts in Amazon S3 and logging experiment results to MLflow. Finally, you will view and compare experiment results in the MLflow UI to evaluate model performance.

The following diagram that shows the interaction between various AWS services, AWS DLCs, and SageMaker managed MLflow for the solution.

End-to-end ML process: TensorFlow script to ECR, EC2 training, MLflow logging, S3 storage

The workflow consists of the following steps:

Develop a TensorFlow neural network model for abalone age prediction. Integrate SageMaker managed MLflow tracking within the model code to log parameters, metrics, and artifacts.
Pull an optimized TensorFlow training container from the AWS public ECR repository. Configure Amazon EC2 and DLAMI with access to the MLflow tracking server using an AWS Identity and Access Management (IAM) role for EC2.
Execute the training process within the DLC running on Amazon EC2, store model artifacts in Amazon S3, and log all experiment results and register model in MLflow.
Compare experiment results through the MLflow UI.

Prerequisites

To follow along with this walkthrough, make sure you have the following prerequisites:

An AWS account with billing enabled.
An EC2 instance (t3.large or larger) running Ubuntu 20.4 or later with at least 20 GB of available disk space for Docker images and containers.
Docker (latest) installed on the EC2 instance.
The AWS Command Line Interface (AWS CLI) version 2.0 or later.
An IAM role with permissions for the following:

Amazon EC2 to talk to SageMaker managed MLflow.
Amazon ECR to pull the TensorFlow container.
SageMaker managed MLflow to track experiments and register models.

An Amazon SageMaker Studio domain. To create a domain, refer to Guide to getting set up with Amazon SageMaker AI. Add the sagemaker-mlflow:AccessUI permission to the SageMaker execution role created. This permission makes it possible to navigate to MLflow from the SageMaker Studio console.
A SageMaker managed MLflow tracking server set up in SageMaker AI.
Internet access from the EC2 instance to download the abalone dataset.
The GitHub repository cloned to your EC2 instance.

Deploy the solution

Detailed step-by-step instructions are available in the accompanying GitHub repository’s README file. The walkthrough covers the entire workflow—from provisioning infrastructure and setting up permissions to executing your first training job with comprehensive experiment tracking.

Analyze experiment results

After you’ve implemented the solution following the steps in the README file, you can access and analyze your experiment results. The following screenshots demonstrate how SageMaker managed MLflow provides comprehensive experiment tracking, model governance, and auditability for your deep learning workloads. When training is complete, all experiment metrics, parameters, and artifacts are automatically captured in MLflow, providing a central location to track and compare your model development journey. The following screenshot shows the experiment abalone-tensorflow-experiment with a run named unique-cod-104. This dashboard gives you a complete overview of all experiment runs, so you can compare different approaches and model iterations at a glance.

MLflow dashboard with abalone-tensorflow experiment status, metrics, and dataset tracking

The following screenshot shows the detailed information for run unique-cod-104, including the registered model abalone-tensorflow-custom-callback-model (version v2). This view provides critical information about model provenance, showing exactly which experiment run produced which model version, which is a key component of model governance.

Detailed MLflow run overview showing experiment metrics, neural network parameters, and performance statistics for abalone-tensorflow model

The following visualization tracks the training loss across epochs, captured using a custom callback. Such metrics help you understand model convergence patterns and evaluate training performance, giving insights into potential optimization opportunities.

Training loss graph showing convergence over time

The registered models view illustrated in the following screenshot shows how abalone-tensorflow-custom-callback-model is tracked in the model registry. This integration enables versioning, model lifecycle management, and deployment tracking.

MLflow model registry interface showing version history, creation timestamps, and management options for abalone model

The following screenshot illustrates one of the solution’s most powerful governance features. When logging a model with mlflow.tensorflow.log_model() using the registered_model_name parameter, the model is automatically registered in the Amazon SageMaker Model Registry. This creates full traceability from experiment to deployed model, establishing an audit trail that connects your training runs directly to production models.

SageMaker Studio interface showing Abalone TensorFlow model version 2 details with MLflow integration

This seamless integration between your custom training environments and SageMaker governance tools helps you maintain visibility and compliance throughout your ML lifecycle.

The model artifacts are automatically uploaded to Amazon S3 after training completion, as illustrated in the following screenshot. This organized storage structure makes sure all model components including weights, configurations, and associated metadata are securely preserved and accessible through a standardized path.

MLflow model configuration details showing Python environment and TensorFlow specifications

Cost implications

The following resources incur costs. Refer to the respective pricing page to estimate costs.

Amazon EC2 On-Demand – Refer to the instance size and AWS Region where it has been provisioned. Storage using Amazon Elastic Block Store (Amazon EBS) is additional.
SageMaker managed MLflow – You can find the details under On-demand pricing for SageMaker MLflow for tracking and storage.
Amazon S3 – Refer to pricing for storage and requests.
SageMaker Studio – There is no additional charges for using the SageMaker Studio UI. However, any EFS or EBS volumes attached, jobs or resources launched from SageMaker Studio applications, or JupyterLab applications will incur costs.

Clean up

Complete the following steps to clean up your resources:

Delete the MLflow tracking server, because it continues to incur costs as long as it’s running:

aws sagemaker delete-mlflow-tracking-server \
–tracking-server-name

Stop the EC2 instance to avoid incurring additional costs:

aws ec2 stop-instances –instance-ids

Remove training data, model artifacts, and MLflow experiment data from S3 buckets:

aws s3 rm s3:/// –recursive

Review and clean up any temporary IAM roles created for the EC2 instnce.
Delete your SageMaker Studio domain.

Conclusion

AWS DLCs and SageMaker managed MLflow provide ML teams a solution that balances the trade-off between governance and flexibility. This integration helps data scientists seamlessly track experiments and deploy models for inference, and helps administrators establish secure, scalable SageMaker managed MLflow environments. Organizations can now standardize their ML workflows using either AWS DLCs or DLAMIs while accommodating specialized requirements, ultimately accelerating the journey from model experimentation to business impact with greater control and efficiency.

In this post, we explored how to integrate custom training environments with SageMaker managed MLflow to gain comprehensive experiment tracking and model governance. This approach maintains the flexibility of your preferred development environment while benefiting from centralized tracking, model registration, and lineage tracking. The integration provides a perfect balance between customization and standardization, so teams can innovate while maintaining governance best practices.

Now that you understand how to track training in DLCs with SageMaker managed MLflow, you can implement this solution in your own environment. All code examples and implementation details from this post are available in our GitHub repository.

For more information, refer to the following resources:

About the authors

Gunjan Jain, an AWS Solutions Architect based in Southern California, specializes in guiding large financial services companies through their cloud transformation journeys. He expertly facilitates cloud adoption, optimization, and implementation of Well-Architected best practices. Gunjan’s professional focus extends to machine learning and cloud resilience, areas where he demonstrates particular enthusiasm. Outside of his professional commitments, he finds balance by spending time in nature.

Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWe Finally Know How Much It Cost to Train China’s Astonishing DeepSeek Model
Next Article Cohere’s Ivan Zhang: How Enterprise-First A.I. Built $6.8B Company
Advanced AI Editor
  • Website

Related Posts

Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics

September 18, 2025

Build Agentic Workflows with OpenAI GPT OSS on Amazon SageMaker AI and Amazon Bedrock AgentCore

September 17, 2025

Supercharge your organization’s productivity with the Amazon Q Business browser extension

September 17, 2025

Comments are closed.

Latest Posts

Jackson Pollock Masterpiece Found to Contain Extinct Manganese Blue

Marian Goodman Adds Edith Dekyndt, New Gagosian Director: Industry Moves

How Much to Pay for Emerging Artists’ Work? Art Adviser Says $15,000 Max

Basquiat Biopic ‘Samo Lives’ Filming in East Village

Latest Posts

OpenAI Acknowledges the Teen Problem

September 18, 2025

MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus

September 18, 2025

Huawei announces new AI infrastructure as Nvidia gets locked out of China

September 18, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • OpenAI Acknowledges the Teen Problem
  • MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus
  • Huawei announces new AI infrastructure as Nvidia gets locked out of China
  • MARKET WARNING: FED Rate Cuts aren’t working. This is bad.
  • AI INVESTOR ALERT: Bronstein, Gewirtz & Grossman LLC Announces that C3.ai, Inc. Investors with Substantial Losses Have Opportunity to Lead Class Action Lawsuit

Recent Comments

  1. RonaldNeefe on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. royalqq on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. GustavoPsymn on C3 AI and Arcfield Announce Partnership to Accelerate AI Capabilities to Serve U.S. Defense and Intelligence Communities
  4. Porno Film izle on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Juniorfar on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.