This post was co-written with Datadog’s Curtis Maher and Anjali Thatte.
This post describes Datadog’s new integration with AWS Neuron. This provides deep visibility into resource utilization, model execution performance, latency, and real-time infrastructure health to optimize machine learning and enables monitoring of AWS Trainium and AWS Inferentia instances. It will be. (ML) workloads and deliver high performance at scale.
Neuron is an SDK used to run deep learning workloads on Trainium and Inferentia-based instances. AWS AI Chips Trainium and Inferentia enable you to build and deploy generative AI models with higher performance and lower cost. As the use of large models increases and a large number of accelerated compute instances are required, observability plays a critical role in ML operations, improving performance, diagnosing and fixing faults, and resource usage. This allows for rate optimization.
Datadog is an observability and security platform that provides real-time monitoring of cloud infrastructure and ML operations. Datadog is excited to launch our Neuron integration. This brings metrics collected by the Neuron SDK’s Neuron Monitor tool into Datadog, allowing you to track the performance of your Trainium- and Inferentia-based instances. Datadog provides real-time visibility into model performance and hardware usage for efficient training and inference, optimized resource utilization, and no service delays.
Comprehensive monitoring of Trainium and Inferentia
The Datadog and Neuron SDK integration automatically collects metrics and logs from your Trainium and Inferentia instances and sends them to the Datadog platform. Enabling the integration makes it easy for users to discover out-of-the-box dashboards in Datadog and start monitoring right away. You can also modify existing dashboards and monitors, or add new dashboards and monitors to suit your specific monitoring requirements.
The Datadog dashboard provides a detailed view of the performance of your AWS AI chip (Trainium or Inferentia), including instance count, availability, and AWS Region. Real-time metrics give you an instant snapshot of your infrastructure health, and preconfigured monitors alert your team to critical issues such as latency, resource utilization, and execution errors. The following screenshot shows an example dashboard.
For example, if a particular instance experiences a spike in latency, a monitor in the monitor overview section of the dashboard will turn red and an alert will be triggered through Datadog or other paging mechanisms (such as Slack or email). High latency can indicate high user demand or an inefficient data pipeline, which can lead to slow response times. Identifying these signals early allows your team to respond quickly in real-time and maintain a high-quality user experience.
Datadog’s Neuron integration allows you to track key performance aspects, providing critical insights for troubleshooting and optimization.
- NeuronCore counters – Monitoring NeuronCore utilization helps ensure cores are being used efficiently and identifies if adjustments are needed to balance workloads or optimize performance. Masu.
- Run Status – You can monitor the progress of your training job, including completed tasks and failed runs. This data allows you to ensure that your model is training smoothly and reliably. Increased failures can create issues with data quality, model configuration, or resource limitations that need to be addressed.
- Memory Used – Provides a detailed view of memory usage across both host and Neuron devices, including memory allocated for tensors and model execution. This helps you understand how effectively your resources are being used and when it’s time to rebalance your workload or scale your resources to avoid interruptions during training due to bottlenecks. .
- Neuron runtime vCPU usage – You can monitor vCPU usage to ensure that your model is not overloading your infrastructure. When your vCPU utilization exceeds a certain threshold, you are alerted to decide whether to redistribute your workload or upgrade your instance type to avoid slowing down your training.
By consolidating these metrics into a single view, Datadog provides powerful tools for maintaining efficient and high-performing Neuron workloads, allowing teams to identify issues in real-time and improve infrastructure as needed. Enables you to optimize your structure. Use Neuron’s integration with Datadog’s LLM observability capabilities to gain comprehensive visibility into your large-scale language model (LLM) applications.
Try using Datadog, Inferentia, and Trainium
Datadog and Neuron integration provides real-time visibility into Trainium and Inferentia to help you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see AWS Inferentia and AWS Trainium Monitoring.
For more information about how Datadog integrates with Amazon ML Services and Datadog LLM observability, see Monitoring Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.
If you don’t already have a Datadog account, you can sign up for a 14-day free trial today.
About the author
Curtis Maher I’m a Product Marketing Manager at Datadog, focusing on the platform’s cloud and AI/ML integrations. Curtis will work closely with Datadog’s product, marketing, and sales teams to coordinate product launches and help customers monitor and secure their cloud infrastructure.
Anjali Thatte I’m a Product Manager at Datadog. She is currently focused on building technology to monitor AI infrastructure and ML tools, and helping customers gain visibility across their AI application technology stack.
jason mimic As a Senior Partner Solutions Architect at AWS, I work closely with product, engineering, marketing, and sales teams on a daily basis.
Anuj Sharma I’m a Principal Solutions Architect at Amazon Web Services. He specializes in application modernization using practical technologies such as serverless, containers, generative AI, and observability. With over 18 years of experience in application development, he currently leads co-builds with AWS software partners focused on containers and observability.