New Capabilities In Amazon SageMaker AI Continue To Transform How Organizations Develop AI Models

As AI models become increasingly sophisticated and specialized, the ability to quickly train and customize models can mean the difference between industry leadership and falling behind. That is why hundreds of thousands of customers use the fully managed infrastructure, tools, and workflows of Amazon SageMaker AI to scale and advance AI model development. Since launching in 2017, SageMaker AI has transformed how organizations approach AI model development by reducing complexity while maximizing performance. Since then, we’ve continued to relentlessly innovate, adding more than 420 new capabilities since launch to give customers the best tools to build, train, and deploy AI models quickly and efficiently. Today, we’re pleased to announce new innovations that build on the rich features of SageMaker AI to accelerate how customers build and train AI models.

Amazon SageMaker HyperPod: The infrastructure of choice for developing AI models

AWS launched Amazon SageMaker HyperPod in 2023 to reduce complexity and maximize performance and efficiency when building AI models. With SageMaker HyperPod, you can quickly scale generative AI model development across thousands of AI accelerators and reduce foundation model (FM) training and fine-tuning development costs by up to 40%. Many of today’s top models are trained on SageMaker HyperPod, including models from Hugging Face, Luma AI, Perplexity AI, Salesforce, Thomson Reuters, Writer, and Amazon. By training Amazon Nova FMs on SageMaker HyperPod, Amazon saved months of work and increased utilization of compute resources to more than 90%.

To further streamline workflows and make it faster to develop and deploy models, a new command line interface (CLI) and software development kit (SDK) provides a single, consistent interface that simplifies infrastructure management, unifies job submission across training and inference, and supports both recipe-based and custom workflows with integrated monitoring and control. Today, we are also adding two capabilities to SageMaker HyperPod that can help you reduce training costs and accelerate AI model development.

Reduce the time to troubleshoot performance issues from days to minutes with SageMaker HyperPod observability

To bring new AI innovations to market as quickly as possible, organizations need visibility across AI model development tasks and compute resources to optimize training efficiency and detect and resolve interruptions or performance bottlenecks as soon as possible. For example, to investigate if a training or fine-tuning job failure was the result of a hardware issue, data scientists and machine learning (ML) engineers want to quickly filter to review the monitoring data of the specific GPUs that performed the job rather than manually browsing through the hardware resources of an entire cluster to establish the correlation between the job failure and a hardware issue.

The new observability capability in SageMaker HyperPod transforms how you can monitor and optimize your model development workloads. Through a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Service for Prometheus workspace, you can now see generative AI task performance metrics, resource utilization, and cluster health in a single view. Teams can now quickly spot bottlenecks, prevent costly delays, and optimize compute resources. You can define automated alerts, specify use case-specific task metrics and events, and publish them to the unified dashboard with just a few clicks.

By reducing troubleshooting time from days to minutes, this capability can help you accelerate your path to production and maximize the return on your AI investments.

DatologyAI builds tools to automatically select the best data on which to train deep learning models.

“We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics—from task-specific GPU utilization to file system (FSx for Lustre) performance—without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems.”
–Josh Wills, Member of Technical Staff at DatologyAI

–

Articul8 helps companies build sophisticated enterprise generative AI applications.

“With SageMaker HyperPod observability, we can now deploy our metric collection and visualization systems in a single click, saving our teams days of otherwise manual setup and enhancing our cluster observability workflows and insights. Our data scientists can quickly monitor task performance metrics, such as latency, and identify hardware issues without manual configuration. SageMaker HyperPod observability will help streamline our foundation model development processes, allowing us to focus on advancing our mission of delivering accessible and reliable AI-powered innovation to our customers.”
–Renato Nascimento, head of technology at Articul8

–

Deploy Amazon SageMaker JumpStart models on SageMaker HyperPod for fast, scalable inference

After developing generative AI models on SageMaker HyperPod, many customers import these models to Amazon Bedrock, a fully managed service for building and scaling generative AI applications. However, some customers want to use their SageMaker HyperPod compute resources to speed up their evaluation and move models into production faster.

Now, you can deploy open-weights models from Amazon SageMaker JumpStart, as well as fine-tuned custom models, on SageMaker HyperPod within minutes with no manual infrastructure setup. Data scientists can run inference on SageMaker JumpStart models with a single click, simplifying and accelerating model evaluation. This straightforward, one-time provisioning reduces manual infrastructure setup, providing a reliable and scalable inference environment with minimal effort. Large model downloads are reduced from hours to minutes, accelerating model deployments and shortening the time to market.

–

H.AI exists to push the boundaries of superintelligence with agentic AI.

“With Amazon SageMaker HyperPod, we used the same high-performance compute to build and deploy the foundation models behind our agentic AI platform. This seamless transition from training to inference streamlined our workflow, reduced time to production, and delivered consistent performance in live environments. SageMaker HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.”
–Laurent Sifre, Co-founder & CTO at H.AI

–

Seamlessly access the powerful compute resources of SageMaker AI from local development environments

Today, many customers choose from the broad set of fully managed integrated development environments (IDEs) available in SageMaker AI for model development, including JupyterLab, Code Editor based on Code-OSS, and RStudio. Although these IDEs enable secure and efficient setups, some developers prefer to use local IDEs on their personal computers for their debugging capabilities and extensive customization options. However, customers using a local IDE, such as Visual Studio Code, couldn’t easily run their model development tasks on SageMaker AI until now.

With new remote connections to SageMaker AI, developers and data scientists can quickly and seamlessly connect to SageMaker AI from their local VS Code, maintaining access to the custom tools and familiar workflows that help them work most efficiently. Developers can build and train AI models using their local IDE while SageMaker AI manages remote execution, so you can work in your preferred environment while still benefiting from the performance, scalability, and security of SageMaker AI. You can now choose your preferred IDE—whether that is a fully managed cloud IDE or VS Code—to accelerate AI model development using the powerful infrastructure and seamless scalability of SageMaker AI.

–

CyberArk is a leader in Identity Security, which provides a comprehensive approach centered on privileged controls to protect against advanced cyber threats.

“With remote connections to SageMaker AI, our data scientists have the flexibility to choose the IDE that makes them most productive. Our teams can leverage their customized local setup while accessing the infrastructure and security controls of SageMaker AI. As a security first company, this is extremely important to us as it ensures sensitive data stays protected, while allowing our teams to securely collaborate and boost productivity.”
–Nir Feldman, Senior Vice President of Engineering at CyberArk

–

Build generative AI models and applications faster with fully managed MLflow 3.0

As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Customers such as Cisco, SonRai, and Xometry are already using managed MLflow on SageMaker AI to efficiently manage ML model experiments at scale. The introduction of fully managed MLflow 3.0 on SageMaker AI makes it straightforward to track experiments, monitor training progress, and gain deeper insights into the behavior of models and AI applications using a single tool, helping you accelerate generative AI development.

Conclusion

In this post, we shared some of the new innovations in SageMaker AI to accelerate how you can build and train AI models.

To learn more about these new features, SageMaker AI, and how companies are using this service, refer to the following resources:

About the author

Ankur Mehrotra joined Amazon back in 2008 and is currently the General Manager of Amazon SageMaker AI. Before Amazon SageMaker AI, he worked on building Amazon.com’s advertising systems and automated pricing technology.

Source link

What's Hot

Google Photos upgrades its image-to-video feature with Veo 3

Free Mark Cuban Foundation AI Bootcamp Coming to Richmond This Fall

C3.AI Stock Is Tumbling Thursday: What’s Going On? – C3.ai (NYSE:AI)

New capabilities in Amazon SageMaker AI continue to transform how organizations develop AI models

Enhancing LLM accuracy with Coveo Passage Retrieval on Amazon Bedrock

Unlocking the future of professional services: How Proofpoint uses Amazon Q Business

Authenticate Amazon Q Business data accessors using a trusted token issuer

Kadist to Close San Francisco Art Space After 14 Years

Nazi-Looted Painting from Argentine Home May Have Been Recovered

Moche Residence Unearthed at Archaeological Site in Northern Peru

Kim Sajet to Helm the Milwaukee Art Museum