Datadog launches GPU Monitoring to curb AI cloud costs

Thu, 23rd Apr 2026

Datadog has launched GPU Monitoring worldwide for customers seeking to control the cost and use of graphics processing units in AI workloads.

The product gives developers, machine learning engineers and platform teams a single view of GPU health, workload performance and spending across their environments. It is designed to help organisations identify idle or underused GPUs, track bottlenecks affecting model training and match workloads to specific GPU models.

The launch comes as businesses increase spending on AI infrastructure and face growing pressure to account for how expensive computing resources are allocated. GPU instances account for 14 per cent of compute costs, Datadog said, underscoring how quickly cloud bills can rise as AI projects scale.

According to Datadog, many existing GPU management tools focus on basic device health data but do not show how infrastructure issues relate to failed training runs, inference slowdowns or unused capacity. That gap can lead teams to overprovision hardware as a precaution, raising costs and making planning harder.

Datadog said its monitoring product links telemetry from GPU fleets to the workloads using those resources. This allows engineering and machine learning teams to investigate the same problems through a shared operational view instead of relying on separate tools and datasets.

Yanbing Li, Chief Product Officer at Datadog, said the challenge has moved beyond technical operations into financial planning.

"GPU instances account for 14 percent of compute costs-which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can't chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways," Li said.

Li added that without a consolidated view across infrastructure and workloads, companies struggle to make decisions on purchasing, allocation and troubleshooting.

"Smartly managing AI spend becomes a board-level conversation when capacity is misallocated, training and inference workloads stall, and costs escalate. We all know managing GPU costs is a huge problem we need to solve, but most companies are experimenting with solutions and it is still very difficult to get a single view of what is happening across the stack. GPU Monitoring fixes that with efficiency and reliability that we haven't seen before," Li said.

Customer use

Hyperbolic, a provider of AI cloud services, is among the product's early users. It uses the tool to monitor a multi-tenant GPU environment in which multiple customers share underlying infrastructure.

Such setups can make visibility more difficult because operators need to understand performance and utilisation at both the fleet and individual device level while keeping customer environments separate.

"Datadog GPU Monitoring has made it easy for us to stay on top of our multi-tenant GPU infrastructure. We get per-instance, per-device visibility into core utilization, memory, power and thermals right out of the box with no extra setup. The dashboards are rich out of the gate and simple to customize, and standing up isolated views per customer takes minutes," said Kai Huang, Head of Product at Hyperbolic.

Hyperbolic also uses Datadog's LLM observability tooling alongside the new product.

"Layering on LLM Observability ties it all together. We can go from a model latency spike straight to the underlying GPU metrics without switching tools. Full stack AI observability in one platform means both our team and our customers can move faster with confidence," Huang said.

Broader pressure

The launch reflects a broader shift in enterprise AI spending, as the cost of model training and inference becomes a central operating issue rather than a specialist engineering concern. GPUs remain scarce and expensive in many parts of the market, and companies often face long procurement cycles when they decide they need more capacity.

In that environment, tools that show whether existing hardware is being fully used can shape both budgeting and deployment decisions. They can also influence how costs are assigned across teams, particularly in large businesses where AI development spans several business units.

Datadog has been expanding its product set across infrastructure, applications, logs, security and AI observability. With GPU Monitoring, it is targeting a specific pain point in AI operations: connecting hardware utilisation and health data with the applications and models consuming those resources.

The product is designed to help teams decide whether to buy more GPUs or free capacity from existing fleets, shorten the time needed to diagnose slow workloads, and spot unhealthy devices before they disrupt training or inference jobs.

GPU Monitoring is now generally available.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google