benchmarking
43 TopicsDGX Cloud Benchmarking on Azure
This blog presents our benchmarking results for NVIDIA DGX Cloud workloads on Azure, scaling from 8 to 1024 H100 GPUs. We detail the Slurm-based setup using Azure CycleCloud Workspace for Slurm, performance validation via NCCL and thermal screening, and tuning strategies that deliver near-parity with NVIDIA DGX reference metrics.Computer-Aided Engineering “CAE” on Azure
Table of Contents: What is Computer-Aided Engineering (CAE)? Why Moving CAE to Cloud? Cloud vs. On-Premises What Makes Azure Special for CAE Workloads? What Makes Azure Stand out Among Public Cloud Providers? “InfiniBand Interconnect” Key CAE Workloads on Azure Azure HPC VM Series for CAE Workloads CAE Software Partnership “ISV’s” Robust Ecosystem of System Integrator “SI” Partners Real-World Use Case: Automotive Sector Final Thoughts -------------------------------------------------------------------------------------------------------- 1. What is Computer-Aided Engineering “CAE”? Computer-Aided Engineering (CAE) is a broad term that refers to the use of computer software to aid in engineering tasks. This includes simulation, validation, and optimization of products, processes, and manufacturing tools. CAE is integral to modern engineering, allowing engineers to explore ideas, validate concepts, and optimize designs before building physical prototypes. CAE encompasses various fields such as finite element analysis (FEA), computational fluid dynamics (CFD), and multibody dynamics (MBD) CAE tools are widely used in industries like automotive, aerospace, and manufacturing to improve product design and performance. For example, in the automotive industry, CAE tools help reduce product development costs and time while enhancing the safety, comfort, and durability of vehicles CAE tools are often used to analyze and optimize designs created within CAD (Computer-Aided Design) software CAE systems typically involve three phases: Pre-processing: Defining the model and environmental factors to be applied to it. Analysis solver: Performing the analysis, usually on high-powered computers. Post-processing: Visualizing the results In a world where product innovation moves faster than ever, Computer-Aided Engineering (CAE) has become a cornerstone of modern design and manufacturing. From simulating airflow over an F1 car to predicting stress in an aircraft fuselage, CAE allows engineers to explore ideas, validate concepts, and optimize designs—before a single prototype is built. -------------------------------------------------------------------------------------------------------- 2. Why Move CAE to Cloud? Cloud vs. On-Premises Historically, CAE workloads were run on-premises due to their compute-intensive nature and large data requirements. Traditional CAE methods—dependent on expensive, on-premises HPC clusters—are facing a tipping point. Many organizations are now embracing cloud-based CAE. When considering whether to use cloud or on-premises solutions, there are several factors to consider: Cost and Maintenance: On-premises solutions require a large upfront investment in hardware and ongoing costs for maintenance and upgrades. Cloud solutions, on the other hand, spread costs over time and often result in lower total cost of ownership. Security and Privacy: On-premises solutions offer control over security but require significant resources to manage. Cloud providers offer advanced security features and compliance certifications, often surpassing what individual companies can achieve on their own Scalability and Flexibility: Cloud solutions provide unmatched scalability and flexibility, allowing businesses to quickly adjust resources based on demand. On-premises solutions can be more rigid and require additional investments to scale Reliability and Availability: Cloud providers offer high availability and disaster recovery options, often with service level agreements (SLAs) guaranteeing uptime. On-premises solutions depend on the company's infrastructure and may require additional investments for redundancy and disaster recovery Integration and Innovation: Cloud solutions often integrate seamlessly with other cloud services and offer continuous innovation through regular updates, new features, and run more simulations in parallel, reducing time-to-solution, accelerating product development cycle, and faster time to market. On-premises solutions may lag in terms of innovation and require manual integration efforts. Global Access: Teams can collaborate and access data/models from anywhere. Cloud gives you global, on-demand supercomputing access without the physical, financial, and operational burden of traditional on-premise clusters. In summary, the choice between cloud and on-premises solutions depends on various factors including cost, performance, security, maintenance, flexibility, and specific business needs. Cloud provides customers with global scalability, high availability, and a broad range of capabilities within a secure, integrated platform. It enables organizations to concentrate on core product innovation, accelerating their journey to market. The following table shows Azure vs. on-premises for CAE Workloads: Aspect Cloud (Azure) On-Premises Global Reach 60+ regions worldwide — deploy compute close to users, customers, or engineers. Limited to where physical hardware is located (one or few sites). Access Flexibility Access from anywhere with secure authentication (VPN/SSO/Conditional Access). Access generally restricted to internal corporate network or VPN. Collaboration Teams across continents can work on shared HPC clusters easily. Remote collaboration can be slow and complex; security risks higher. Elastic Scaling Instantly scale resources up/down globally based on demand. Start small, grow big — then shrink when needed. Scaling requires buying, installing, maintaining new hardware. Time to Deploy No wait for procurement. Minutes to spin up a new HPC cluster in a new region. Weeks/months to procure, rack, configure hardware in new location. Disaster Recovery Built-in regional redundancy, backup options, replication across regions. Disaster recovery requires manual setup, physical duplication. Compliance & Data Residency Choose specific Azure regions to meet compliance (GDPR, HIPAA, ITAR, etc.). Need to build compliant infrastructure manually. Network Latency Optimize by deploying close to users; fast backbone network across regions. Bound by physical proximity; long-distance remote work suffers latency. Maintenance Azure handles hardware upgrades, security patches, downtime minimization. In-house IT teams responsible for all hardware, software, and patching. Security at Scale MSFT commits to invest $20B on cybersecurity over five years. Azure invests >$1B annually in cybersecurity; ISO, SOC, GDPR certified globally. Requires dedicated resources to manage security protocols and maintain visibility across all systems. This can be more complex and resource-intensive compared to cloud solutions Cost Optimization Operates on a pay-as-you-go model, enabling businesses to scale usage and costs as needed. This avoids the capital expenditure of purchasing hardware. Azure also offers various pricing options and discounts, such as reserved capacity, spot pricing, and Azure Hybrid Benefit, which can significantly reduce costs — massive cost control flexibility. Requires significant upfront capital investment in hardware, software licenses, and infrastructure setup. These costs include purchasing and maintaining physical servers, which are subject to technological obsolescence. Ongoing expenses include system maintenance, support, power consumption, and cooling Innovation Access latest GPUs, CPUs (like H100, H200, GB200, AMD-MI300X, HBv3, HBv4, HBv5) Needs investments in hardware refresh cycles. Managed Storage Offers agility with instant provisioning. Scalability as virtually unlimited with automatic scale up or down. Fully managed including updates, patches, backup, etc. High Availability & DR through redundancy, geo-replication, and automated DR options. Security through enterprise-grade security with encryption at rest and in transit & compliance certifications. Pay-as-you-go or reserved pricing with no upfront HW cost (CapEx). Global access through internet. Innovation through continuous improvements with Ai-driven optimization. Offers control but demands heavy investment in HW, time-consuming deployment. Scaling is limited by physical HW capacity. Must be managed by in-house IT teams so required significant time expertise and resources. Redundancy & DR must be designed, funded and maintained manually. Security depends on in-house capabilities and requires investment. High upfront capital expenditure (CapEx). Access limited to local networks unless extended with complex remote-access solutions. Innovation depends on HW refresh cycles limited by expense and infrequency. Software Images & Marketplace Instant access to thousands of pre-built software images via Marketplace. Speedy deployment of complete environments in minutes from ready-to-use templates. Huge ecosystem — access to Microsoft, open-source, and third-party vendor solutions — constantly updated. Automated maintenance and updates as Marketplace software often comes with built-in update capabilities, auto-patching, and cloud-optimized versions. Cost flexibility by either Pay-as-you-go (PAYG) licensing, bring-your-own-license (BYOL) options, or subscription models available. Innovation trough early access to beta, cloud-native, and AI-enhanced software from top vendors through the marketplace. Security is guarded s Marketplace images are verified by cloud provider security and compliance standards. Software must be sourced, manually installed, and configured so takes days to weeks. Manual deployment, installation, environment setup, and configuration can take days or weeks. Limited by licensing agreements, internal vendor contracts, and physical hardware compatibility. Manual updates required and IT must monitor, download, test, and apply patches individually. Large upfront license purchases often needed with renewal and true-up costs can be complex and expensive. Innovation is limited as new software adoption is delayed by procurement, budgeting, and testing cycles. Security assurance depends on internal vetting processes and manual hardening. -------------------------------------------------------------------------------------------------------- 3. What Makes Azure Special for CAE Workloads? Microsoft Azure: a cloud platform enabling scalable, secure, and high-performance CAE workflows across industries. Our goal in Azure is to provide the CAE field with a one-stop, best-in-class technology platform, rich with solution offerings and supported by a robust ecosystem of partners. Azure offers several unique features and benefits that make it particularly well-suited for Computer-Aided Engineering (CAE) workloads: GPU Acceleration: Azure provides powerful GPU options, such as NVIDIA GPUs, which significantly enhance the performance of leading CAE tools. This results in improved turnaround times, reduced power consumption, and lower hardware costs. For example, tools like Ansys Speos for lighting simulation and CPFD's Barracuda Virtual Reactor have been optimized to take advantage of these GPUs. High-Performance Computing (HPC): Azure offers specialized HPC solutions, such as the HBv3, HBv4/HX series, which are designed for high-performance workloads. These solutions provide the computational power needed for complex simulations and analyses. Scalability and Flexibility: Azure's cloud infrastructure allows for easy scaling of resources to meet the demands of CAE workloads. This flexibility ensures that you can handle varying levels of computational intensity without the need for significant upfront investment in hardware. Integration with Industry Tools: Azure supports a wide range of CAE software and tools, making it easier to integrate existing workflows into the cloud environment. This includes certification and optimization of CAE tools on Azure. Support for Hybrid Environments: Azure provides solutions for hybrid cloud environments, allowing you to seamlessly integrate on-premises resources with cloud resources. This is particularly useful for organizations transitioning to the cloud or requiring a hybrid setup for specific workloads. Global Reach: As of April 2025, Microsoft Azure operates over 60 announced regions and more than 300 data centers worldwide, making it the most expansive cloud infrastructure among major providers. Azure ensures low latency and high availability for CAE workloads, regardless of where your team is located. These features collectively make Azure a powerful and flexible platform for running CAE workloads, providing the computational power, scalability, and security needed to handle complex engineering simulations and analyses. -------------------------------------------------------------------------------------------------------- 4. What Makes Azure Stand out Among Public Cloud Providers? “InfiniBand Interconnect” InfiniBand interconnect is one of the key differentiators that makes Microsoft Azure stand out among public cloud providers, especially for high-performance computing (HPC) and CAE workloads. Here’s what makes InfiniBand a game changer, unique, and impactful on Azure: a) Ultra-Low Latency & High Memory Bandwidth InfiniBand on Azure delivers 200 Gbps (and up to 400 Gbps with HDR/NDR in some cases and 800 Gbps for the latest SKU,"HBv5", currently in Preview) interconnect speeds. This ultra-low-latency, high-throughput network is ideal for tightly coupled parallel workloads, such as CFD, FEA, weather simulations, and molecular modeling. When the newly added AMD SKU, HBv5, transitions from preview to general availability (GA), memory bandwidth will no longer be a limitation for workloads such as CFD and Weather simulations. The HBv5 offers an impressive 7 TB/s of memory bandwidth, which is 8 times greater than the latest bare-metal and cloud alternatives. It also provides nearly 20 times more bandwidth than Azure HBv3 and Azure HBv2, which utilize the 3rd Gen EPYC™ with 3D V-cache “Milan-X” and the 2nd Gen EPYC™ “Rome” respectively. Additionally, the HBv5 delivers up to 35 times more memory bandwidth compared to a 4–5-year-old HPC server nearing the end of its hardware lifecycle. b) RDMA (Remote Direct Memory Access) Support RDMA enables direct memory access between VMs, bypassing the CPU, which drastically reduces latency and increases application efficiency — a must for HPC workloads. c) True HPC Fabric in the Cloud Azure is the only major public cloud provider that offers InfiniBand across multiple VM families like: HBv3/4 (for CFD, FEA, Multiphysics, Molecular Dynamics) HX-series (Structural Analysis) ND (GPU + MPI) It allows scaling MPI workloads across thousands of cores — something typically limited to on-premises supercomputers. d) Production-Grade Performance for CAE Solvers like ANSYS Fluent, STAR-CCM+, Abaqus, and MSC Nastran have benchmarked extremely well on Azure, thanks in large part to the InfiniBand-enabled infrastructure. If you’re building CAE, HPC, or AI workloads that rely on ultra-fast communication between nodes, Azure’s InfiniBand-powered VM SKUs offer the best cloud-native alternative to on-prem HPC clusters. -------------------------------------------------------------------------------------------------------- 5. Key CAE Workloads on Azure: CAE isn’t a one-size-fits-all domain. Azure supports a broad spectrum of CAE applications, such as: Computational Fluid Dynamics (CFD): ANSYS Fluent, Ansys CFX, Siemens Simcenter STAR-CCM+, Convergent Science CONVERGE CFD, Autodesk CFD, OpenFOAM, NUMECA Fine/Open, Altair ACuSolve, Simerics MP+, Cadence Fidelity CFD, COMSOL Multiphysics (CFD Module), Dassault Systeme XFlow, etc. Finite Element Analysis (FEA): ANSYS Mechanical, Dassault Systemes Abaqus, Altair OptiStruct, Siemens Simcenter 3D, MSC Nastran, Autodesk Fusion 360 Simulation, COMSOL Multiphysics (Structural Module), etc. Thermal & Electromagnetic Simulation: COMSOL Multiphysics, Ansys-HFSS, CST Studio Suite, Ansys Mechanical (Thermal Module), Siemens Simcenter 3D Thermal, Dassault Systemes Abaqus Thermal, etc. Crash & Impact Testing: Ansys LS-DYNA, Altair Radioss, ESI PAM-Crash, Siemens Simcenter Madymo, Dassault Systemes Abaqus “Explicit”, Ansys Autodyn, etc. These applications require a combination of powerful CPUs, big memory footprint, high memory bandwidth, and low-latency interconnects. Some applications also offer GPU-accelerated versions. All of which are available in Azure’s purpose-built HPC VM families. -------------------------------------------------------------------------------------------------------- 6. Azure HPC VM Series for CAE Workloads Azure offers specialized VM series tailored for CAE applications. These VMs support RDMA-enabled InfiniBand networking, critical for scaling CAE workloads across nodes in parallel simulations. CPU: HBv3, HBv4 Series: Ideal for memory-intensive workloads like CFD and FEA, offering high memory bandwidth and low-latency interconnects. HX Series: Optimized for structural analysis applications, providing significant performance boosts for solvers like MSC Nastran & others. GPU: ND Series: GPU-accelerated VMs optimized for CAE workloads, offering high double-precision compute, large memory bandwidth, and scalable performance with NVIDIA H100, H200, GB200 & AMD M300X GPUs. The highest-performing compute-optimized CPU offering in Azure today is the HBv4/HX series, featuring 176 cores of 4th Gen AMD EPYC processors with 3D V-Cache technology (“Genoa-X”). Below is a sample performance comparison of four different AMD SKU generations against the Intel “HCv1-Skylake” SKU, using the Ansys Fluent (F1 Racecar 140M cells) model. Full performance & scalability of HBv4 and HX-Series VMs with Genoa-X CPUs is HERE. -------------------------------------------------------------------------------------------------------- 7. CAE Software Partnership “ISV’s” Independent Software Vendors (ISVs) play a critical role on Azure by bringing trusted, industry-leading applications to the platform. Their solutions — spanning CAE, CFD, FEA, data analytics, AI, and more — are optimized to run efficiently on Azure’s scalable infrastructure. ISVs ensure that customers can seamlessly move their workloads to the cloud without sacrificing performance, compatibility, or technical support. They also drive innovation by collaborating with Azure engineering teams to deliver cloud-native, HPC-ready, and AI-enhanced capabilities, helping businesses accelerate product development, simulations, and decision-making. Below is a partial list of these ISVs & their offerings on Azure: ANSYS Access: SaaS platform built on Azure, offering native cloud experiences for Fluent, Mechanical, LS-Dyna, HFSS, etc. Altair One: SaaS platform on Azure supporting Altair solvers such as HyperWorks, OptiStruct, Radioss, AcuSolve, etc. Siemens Simcenter: Validated on Azure for fluid, structural, and thermal simulation with solvers such as STAR-CCM+, NX, Femap Dassault Systèmes: Solvers such as Abaqus, CATIA, SIMULIA, XFlow COMSOL: For its flagship solver “COMSOL Multiphysics” CPFD Software: CPFD Software has optimized its simulation tool “Barracuda Virtual Reactor” for Azure, enabling engineers to perform particle-fluid simulations efficiently. -------------------------------------------------------------------------------------------------------- 8. Robust Ecosystem of System Integrator “SI” Partners Azure CAE System Integrators (SIs) are specialized partners that assist organizations in deploying and managing CAE workloads on Microsoft Azure. These SIs provide expertise in cloud migration, HPC optimization, and integration of CAE applications, enabling businesses to leverage Azure’s scalable infrastructure for engineering simulations and analyses. a) What Do Azure CAE System Integrators Offer? Azure CAE SIs deliver a range of services tailored to the unique demands of engineering and simulation workloads: Cloud Migration: Transitioning on-premises CAE applications and data to Azure’s cloud environment. HPC Optimization: Configuring Azure’s HPC resources to maximize performance for CAE tasks. Application Integration: Ensuring compatibility and optimal performance of CAE software (e.g., ANSYS, Siemens, Altair, Abaqus) on Azure. Managed Services: Ongoing support, monitoring, and maintenance of CAE environments on Azure. b) Leading Azure CAE System Integrators Several SIs have been recognized for their capabilities in deploying CAE solutions on Azure. Partial list is below: Rescale, TotalCAE, Oakwood Systems, UberCloud “SIMR”, Capgemini, Accenture, Hexagon Manufacturing Intelligence. c) Benefits of Collaborating with Azure CAE SIs By partnering with Azure CAE System Integrators, organizations can effectively harness the power of cloud computing to enhance their engineering and simulation capabilities. Engaging with Azure CAE System Integrators can provide: Expertise: Access to professionals experienced in both CAE applications and Azure infrastructure. Efficiency: Accelerated deployment and optimization of CAE workloads. Scalability: Ability to scale resources up or down based on project requirements. Cost Management: Optimized resource usage leading to potential cost savings. -------------------------------------------------------------------------------------------------------- 9. Real-World Use Case: Automotive Sector Rimac used Azure cloud computing to help with the design, testing, and manufacturing of its next-generation components and sportscars, and it’s gaining even greater scale and speed in its product development processes with a boost from Microsoft Azure HPC Rimac’s Azure HPC environment uses Azure CycleCloud to organize and orchestrate clusters—putting together different cluster types and sizes flexibly and as necessary. The solution includes Azure Virtual Machines, running containers on Azure HBv3 virtual machines with 3 rd Gen AMD EPYC™ Milan Processors with AMD 3D V-Cache, which are much faster than previous generation Azure virtual machines for explicit calculations. Rimac’s solution takes full advantage of the power of AMD, which offers the highest performing x86 CPU for technical computing. “We’ve gained a significant increase in computational speed with AMD, which leads to lower utilization of HPC licenses and faster iterations,” says Ivan Krajinović, Head of Simulations, Rimac Technology “However complex the model we need to create, we know that we can manage it with Azure HPC. We now produce more highly complex models that simply wouldn’t have been possible on our old infrastructure.” Ivan Krajinović -------------------------------------------------------------------------------------------------------- 10. The Future of CAE is Cloud-Native The next frontier in CAE is not just lifting and shifting legacy solvers into the cloud—but enabling cloud-native simulation pipelines. List includes: AI-assisted simulation tuning Serverless pre/post-processing workflows Digital twins integrated with IoT data on Azure Cloud-based visualization with NVIDIA Omniverse With advances in GPU acceleration, parallel file systems (like Azure Managed Lustre File System, AMLFS), and intelligent job schedulers, Azure is enabling this next-gen CAE transformation today. -------------------------------------------------------------------------------------------------------- 11. Final Thoughts Moving CAE to Azure is more than a tech upgrade—it’s a shift in mindset. It empowers engineering teams to simulate more, iterate faster, and design better—without being held back by hardware constraints. If you’re still running CAE workloads on aging, capacity-constrained systems, now is the time to explore what Azure HPC can offer. Let the cloud be your wind tunnel, your test track, your proving ground. -------------------------------------------------------------------------------------------------------- Let’s Connect Have questions or want to share how you’re using CAE in the cloud? Let’s start a conversation! We'd love to hear your thoughts! Leave a comment below and join the conversation. 👇 #CAE #HPC #AzureHPC #EngineeringSimulation #CFD #FEA #CloudComputing #DigitalEngineering #MicrosoftAzureMonitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin DocumentationDeploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.Running Container Workloads in CycleCloud-Slurm – Multi-Node, Multi-GPU Jobs (NCCL Benchmark)
Running high-performance computing (HPC) and AI workloads in the cloud requires a flexible and scalable orchestration platform. Microsoft Azure CycleCloud, when combined with Slurm, provides an efficient solution for managing containerized applications across HPC clusters. In this blog, we will explore how to run multi-node, multi-GPU workloads in a CycleCloud-Slurm environment using containerized workflows. We’ll cover key configurations, job submission strategies, and best practices to maximize GPU utilization across multiple nodes. To simplify this process, we developed cyclecloud-slurm-container, a custom project that automates the setup of Pyxis and Enroot for running containerized workloads in Slurm. This tool streamlines the installation and configuration of required software, making it easier to deploy and manage containerized HPC applications. In this approach, we integrate these scripts to enable container support within a CycleCloud implementation. However, we also offer a product called CycleCloud Workspace for Slurm. Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that simplifies the creation, configuration, and deployment of pre-defined Slurm clusters on Azure using CycleCloud. It eliminates the need for prior knowledge of Azure or Slurm. These Slurm clusters come pre-configured with PMIx v4, Pyxis, and Enroot, enabling seamless execution of containerized AI and HPC workloads. As an example, we will demonstrate how to use the Azure Node Health Check (aznhc) Docker container to run NCCL benchmarks across multiple nodes and GPUs, showcasing the benefits of a well-optimized containerized HPC environment. aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries. Containers bring significant benefits to HPC and AI workloads by enhancing flexibility and efficiency. When integrated with CycleCloud and Slurm, they provide: Portability – Package applications with all dependencies, ensuring consistent execution across different environments. Isolation – Run applications in separate environments to prevent conflicts and maintain system integrity. Reproducibility – Guarantee identical execution of workloads across multiple job submissions for reliable experimentation. Scalability – Dynamically scale containerized workloads across multiple nodes and GPUs, optimizing resource utilization. By leveraging containers within CycleCloud-Slurm, users can streamline workload management, simplify software deployment, and maximize the efficiency of their HPC clusters. Testing Environment for Running Multi-Node, Multi-GPU NCCL Workloads in CycleCloud-Slurm Before executing NCCL benchmarks across multiple nodes and GPUs using containers in a CycleCloud-Slurm setup, ensure the following prerequisites are met: CycleCloud 8.x: A properly configured and running CycleCloud deployment. Virtual Machines: Use Standard_ND96asr_v4 VMs, which include NVIDIA GPUs and an InfiniBand network optimized for deep learning training and high-performance AI workloads. Slurm Configuration: Slurm version 24.05.4-2 (cyclecloud-slurm 3.0.11). Operating System: Ubuntu 22.04 (microsoft-dsvm:ubuntu-hpc:2204:latest), pre-configured with essential GPU, InfiniBand drivers, and HPC tools. Container Runtime Setup: Deploy the cyclecloud-slurm-container project, which automates the configuration of Enroot and Pyxis for efficient container execution. Azure Node Health Check (AZNHC) Container: Utilize this container to validate node health and execute NCCL benchmarks for performance testing. aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries. This setup ensures a reliable and scalable environment for evaluating NCCL performance in a containerized CycleCloud-Slurm cluster. Here’s a polished and structured version of your content: Configuring the CycleCloud-Slurm Container Project Follow these steps to set up the cyclecloud-slurm-container project, which automates the configuration of Pyxis and Enroot for running containerized workloads in Slurm. Step 1: Open a Terminal Session to cyclecloud server Ensure you have access to the CycleCloud server with the CycleCloud CLI enabled. Step 2: Clone the Repository Clone the cyclecloud-slurm-container repository from GitHub: git clone https://github.com/vinil-v/cyclecloud-slurm-container.git Example Output: [azureuser@cc87 ~]$ git clone https://github.com/vinil-v/cyclecloud-slurm-container.git Cloning into 'cyclecloud-slurm-container'... remote: Enumerating objects: 27, done. remote: Counting objects: 100% (27/27), done. remote: Compressing objects: 100% (18/18), done. remote: Total 27 (delta 2), reused 27 (delta 2), pack-reused 0 Receiving objects: 100% (27/27), done. Resolving deltas: 100% (2/2), done. Step 3: Upload the Project to CycleCloud Locker Navigate to the project directory and upload it to the CycleCloud locker: cd cyclecloud-slurm-container/ cyclecloud project upload <locker name> Example Output: [azureuser@cc87 cyclecloud-slurm-container]$ cyclecloud project upload "Team Shared-storage" INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support Job 43b89b46-ca66-3244-6341-d0cb746a87ad has started Log file is located at: /home/azureuser/.azcopy/43b89b46-ca66-3244-6341-d0cb746a87ad.log 100.0 %, 14 Done, 0 Failed, 0 Pending, 14 Total, 2-sec Throughput (Mb/s): 0.028 Job 43b89b46-ca66-3244-6341-d0cb746a87ad Summary Files Scanned at Source: 14 Files Scanned at Destination: 14 Elapsed Time (Minutes): 0.0334 Number of Copy Transfers for Files: 14 Number of Copy Transfers for Folder Properties: 0 Total Number of Copy Transfers: 14 Number of Copy Transfers Completed: 14 Number of Copy Transfers Failed: 0 Number of Deletions at Destination: 0 Total Number of Bytes Transferred: 7016 Total Number of Bytes Enumerated: 7016 Final Job Status: Completed Upload complete! Step 4: Explore Project Structure The project is designed to configure Pyxis and Enroot in both the scheduler and compute nodes within a Slurm cluster. It includes the following directories: [azureuser@cc87 cyclecloud-slurm-container]$ ll specs total 0 drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 default drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 execute drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 scheduler Compute Node Scripts (execute Directory) These scripts configure NVMe storage, set up Enroot, and enable Pyxis for running container workloads. [azureuser@cc87 cyclecloud-slurm-container]$ ll specs/execute/cluster-init/scripts/ total 16 -rw-rw-r--. 1 azureuser azureuser 974 Apr 2 01:44 000_nvme-setup.sh -rw-rw-r--. 1 azureuser azureuser 1733 Apr 2 01:44 001_enroot-setup.sh -rw-rw-r--. 1 azureuser azureuser 522 Apr 2 01:44 002_pyxis-setup-execute.sh -rw-rw-r--. 1 azureuser azureuser 350 Apr 2 01:44 README.txt Scheduler Node Scripts (scheduler Directory) These scripts set up Pyxis on the scheduler node to enable container execution with Slurm. [azureuser@cc87 cyclecloud-slurm-container]$ ll specs/scheduler/cluster-init/scripts/ total 8 -rw-rw-r--. 1 azureuser azureuser 1266 Apr 2 01:44 000_pyxis-setup-scheduler.sh -rw-rw-r--. 1 azureuser azureuser 350 Apr 2 01:44 README.txt Configuring cyclecloud-slurm-container in CycleCloud Portal: Login to CycleCloud Web portal Create a Slurm cluster. In the Required Settings select HPC VM Type as Standard_ND96asr_v4 (in this example). In the Advance Settings you can select the OS for Ubuntu 22.04 LTS ( This image microsoft-dsvm:ubuntu-hpc:2204:latest has the GPU drivers, InfiniBand driver, hpc utilities like MPI etc.) in your CycleCloud Slurm cluster configuration, add “cyclecloud-slurm-container” project as a cluster-init in the scheduler and execute configuration. Click on browse and navigate to cyclecloud-slurm-container directory and select the “scheduler” directory for scheduler and “execute” directory for execute. Scheduler-cluster-init section: Execute cluster-init section: After configuring all the settings, save the changes and start the cluster. Testing the setup: Once the cluster is running, login to scheduler node and create a job script (nccl_benchmark_job.sh) like below. Job script: #!/bin/bash #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=12 #SBATCH --gpus-per-node=8 #SBATCH --exclusive #SBATCH -o nccl_allreduce_%j.log export OMPI_MCA_coll_hcoll_enable=0 \ NCCL_IB_PCI_RELAXED_ORDERING=1 \ CUDA_DEVICE_ORDER=PCI_BUS_ID \ NCCL_SOCKET_IFNAME=eth0 \ NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ NCCL_DEBUG=WARN \ NCCL_MIN_NCHANNELS=32 CONT="mcr.microsoft.com#aznhc/aznhc-nv:latest" PIN_MASK='ffffff000000,ffffff000000,ffffff,ffffff,ffffff000000000000000000,ffffff000000000000000000,ffffff000000000000,ffffff000000000000' MOUNT="/opt/microsoft:/opt/microsoft" srun --mpi=pmix \ --cpu-bind=mask_cpu:$PIN_MASK \ --container-image "${CONT}" \ --container-mounts "${MOUNT}" \ --ntasks-per-node=8 \ --cpus-per-task=12 \ --gpus-per-node=8 \ --mem=0 \ bash -c 'export LD_LIBRARY_PATH="/opt/openmpi/lib:$LD_LIBRARY_PATH"; /opt/nccl-tests/build/all_reduce_perf -b 1K -e 16G -f 2 -g 1 -c 0' Submit a NCCL job using the following command. -N for how many node you want to use for running the NCCL benchmark. In this example I am running the benchmark on 4 nodes. You could change the -N to the desired Nodes of your choice. sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh Output: azureuser@gpu-scheduler:~$ sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh Submitted batch job 61 azureuser@gpu-scheduler:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 61 hpc nccl_ben azureuse CF 0:04 4 gpu-hpc-[1-4] azureuser@gpu-scheduler:~$ Verifying the Results After the job completes, you will find a nccl_allreduce_<jobid>.log file containing the benchmark details for review. azureuser@gpu-scheduler:~$ cat nccl_allreduce_61.log pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest # nThread 1 nGpus 1 minBytes 1024 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0 # # Using devices # Rank 0 Group 0 Pid 16036 on gpu-hpc-1 device 0 [0x00] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 16037 on gpu-hpc-1 device 1 [0x00] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 16038 on gpu-hpc-1 device 2 [0x00] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 16039 on gpu-hpc-1 device 3 [0x00] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 16040 on gpu-hpc-1 device 4 [0x00] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 16041 on gpu-hpc-1 device 5 [0x00] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 16042 on gpu-hpc-1 device 6 [0x00] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 16043 on gpu-hpc-1 device 7 [0x00] NVIDIA A100-SXM4-40GB # Rank 8 Group 0 Pid 17098 on gpu-hpc-2 device 0 [0x00] NVIDIA A100-SXM4-40GB # Rank 9 Group 0 Pid 17099 on gpu-hpc-2 device 1 [0x00] NVIDIA A100-SXM4-40GB # Rank 10 Group 0 Pid 17100 on gpu-hpc-2 device 2 [0x00] NVIDIA A100-SXM4-40GB # Rank 11 Group 0 Pid 17101 on gpu-hpc-2 device 3 [0x00] NVIDIA A100-SXM4-40GB # Rank 12 Group 0 Pid 17102 on gpu-hpc-2 device 4 [0x00] NVIDIA A100-SXM4-40GB # Rank 13 Group 0 Pid 17103 on gpu-hpc-2 device 5 [0x00] NVIDIA A100-SXM4-40GB # Rank 14 Group 0 Pid 17104 on gpu-hpc-2 device 6 [0x00] NVIDIA A100-SXM4-40GB # Rank 15 Group 0 Pid 17105 on gpu-hpc-2 device 7 [0x00] NVIDIA A100-SXM4-40GB # Rank 16 Group 0 Pid 17127 on gpu-hpc-3 device 0 [0x00] NVIDIA A100-SXM4-40GB # Rank 17 Group 0 Pid 17128 on gpu-hpc-3 device 1 [0x00] NVIDIA A100-SXM4-40GB # Rank 18 Group 0 Pid 17129 on gpu-hpc-3 device 2 [0x00] NVIDIA A100-SXM4-40GB # Rank 19 Group 0 Pid 17130 on gpu-hpc-3 device 3 [0x00] NVIDIA A100-SXM4-40GB # Rank 20 Group 0 Pid 17131 on gpu-hpc-3 device 4 [0x00] NVIDIA A100-SXM4-40GB # Rank 21 Group 0 Pid 17132 on gpu-hpc-3 device 5 [0x00] NVIDIA A100-SXM4-40GB # Rank 22 Group 0 Pid 17133 on gpu-hpc-3 device 6 [0x00] NVIDIA A100-SXM4-40GB # Rank 23 Group 0 Pid 17134 on gpu-hpc-3 device 7 [0x00] NVIDIA A100-SXM4-40GB # Rank 24 Group 0 Pid 17127 on gpu-hpc-4 device 0 [0x00] NVIDIA A100-SXM4-40GB # Rank 25 Group 0 Pid 17128 on gpu-hpc-4 device 1 [0x00] NVIDIA A100-SXM4-40GB # Rank 26 Group 0 Pid 17129 on gpu-hpc-4 device 2 [0x00] NVIDIA A100-SXM4-40GB # Rank 27 Group 0 Pid 17130 on gpu-hpc-4 device 3 [0x00] NVIDIA A100-SXM4-40GB # Rank 28 Group 0 Pid 17131 on gpu-hpc-4 device 4 [0x00] NVIDIA A100-SXM4-40GB # Rank 29 Group 0 Pid 17132 on gpu-hpc-4 device 5 [0x00] NVIDIA A100-SXM4-40GB # Rank 30 Group 0 Pid 17133 on gpu-hpc-4 device 6 [0x00] NVIDIA A100-SXM4-40GB # Rank 31 Group 0 Pid 17134 on gpu-hpc-4 device 7 [0x00] NVIDIA A100-SXM4-40GB NCCL version 2.19.3+cuda12.2 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 53.54 0.02 0.04 N/A 55.41 0.02 0.04 N/A 2048 512 float sum -1 60.53 0.03 0.07 N/A 60.49 0.03 0.07 N/A 4096 1024 float sum -1 61.70 0.07 0.13 N/A 58.78 0.07 0.14 N/A 8192 2048 float sum -1 64.86 0.13 0.24 N/A 59.49 0.14 0.27 N/A 16384 4096 float sum -1 134.2 0.12 0.24 N/A 59.91 0.27 0.53 N/A 32768 8192 float sum -1 66.55 0.49 0.95 N/A 61.85 0.53 1.03 N/A 65536 16384 float sum -1 69.26 0.95 1.83 N/A 64.42 1.02 1.97 N/A 131072 32768 float sum -1 73.87 1.77 3.44 N/A 221.6 0.59 1.15 N/A 262144 65536 float sum -1 360.4 0.73 1.41 N/A 91.51 2.86 5.55 N/A 524288 131072 float sum -1 103.5 5.06 9.81 N/A 101.1 5.18 10.04 N/A 1048576 262144 float sum -1 115.6 9.07 17.57 N/A 118.0 8.89 17.22 N/A 2097152 524288 float sum -1 142.8 14.68 28.45 N/A 141.5 14.82 28.72 N/A 4194304 1048576 float sum -1 184.6 22.72 44.02 N/A 183.8 22.82 44.21 N/A 8388608 2097152 float sum -1 277.2 30.26 58.63 N/A 271.9 30.86 59.78 N/A 16777216 4194304 float sum -1 370.4 45.30 87.77 N/A 377.5 44.45 86.12 N/A 33554432 8388608 float sum -1 632.7 53.03 102.75 N/A 638.8 52.52 101.76 N/A 67108864 16777216 float sum -1 1016.1 66.04 127.96 N/A 1018.5 65.89 127.66 N/A 134217728 33554432 float sum -1 1885.0 71.20 137.96 N/A 1853.3 72.42 140.32 N/A 268435456 67108864 float sum -1 3353.1 80.06 155.11 N/A 3369.3 79.67 154.36 N/A 536870912 134217728 float sum -1 5920.8 90.68 175.68 N/A 5901.4 90.97 176.26 N/A 1073741824 268435456 float sum -1 11510 93.29 180.74 N/A 11733 91.52 177.31 N/A 2147483648 536870912 float sum -1 22712 94.55 183.20 N/A 22742 94.43 182.95 N/A 4294967296 1073741824 float sum -1 45040 95.36 184.76 N/A 44924 95.60 185.23 N/A 8589934592 2147483648 float sum -1 89377 96.11 186.21 N/A 89365 96.12 186.24 N/A 17179869184 4294967296 float sum -1 178432 96.28 186.55 N/A 178378 96.31 186.60 N/A # Out of bounds values : 0 OK # Avg bus bandwidth : 75.0205 # Conclusion Integrating containers with CycleCloud-Slurm for multi-node, multi-GPU workloads enables seamless scalability and portability in HPC and AI applications. By leveraging Enroot and Pyxis, we can efficiently execute containerized workloads while ensuring optimal GPU utilization. The cyclecloud-slurm-container project simplifies the deployment process, making it easier for teams to configure, manage, and scale their workloads on Azure HPC clusters. Running NCCL benchmarks inside containers provides valuable insights into communication efficiency across GPUs and nodes, helping optimize AI and deep learning training workflows. By following this guide, you can confidently set up and run containerized multi-node NCCL benchmarks in CycleCloud-Slurm, ensuring peak performance for your AI workloads in the cloud. References ND A100 v4 Series – GPU-Accelerated Virtual Machines Microsoft Azure CycleCloud – Overview Slurm and Containers – Official Documentation NVIDIA Pyxis – Slurm Container Runtime NVIDIA Enroot – Lightweight Container Runtime Azure HPC VM Images – Preconfigured Images for HPC Workloads CycleCloud-Slurm-Container – Project Repository CycleCloud Workspace for SlurmAzure’s ND GB200 v6 Delivers Record Performance for Inference Workloads
Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated world record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70BUnpacking the Performance of Microsoft Azure ND GB200 v6 Virtual Machines
For a comprehensive understanding of our benchmarking methodologies and detailed performance results, please refer to our benchmarking guide available on the official Azure GitHub repository: Azure AI Benchmarking Guide. Breakdown of Benchmark Tests GEMM Performance General Matrix Multiply (GEMM) operations form the backbone of AI models. We measured that more than 60% of the time spent inferencing or training an AI model is spent doing matrix multiplication. Thus, measuring their speed is key to understand the performance of a GPU-based virtual machine. Azure benchmark assesses matrix-to-matrix multiplication efficiency using NVIDIA’s CuBLASLt library with FP8 precision, ensuring results reflect enterprise AI workloads. We measured the peak theoretical performance of the NVIDIA GB200 Blackwell GPU to be 4,856 TFLOPS, representing a 2.45x increase in performance compared to peak theoretical 1,979 TFLOPS on the NVIDIA H100 GPU. This finding is in-line with NVIDIA’s announcement of a 2.5x performance increase at GTC 2024. The true performance gain of the NVIDIA GB200 Blackwell GPU over its predecessors emerges in real-life conditions. For example, using 10,000 warm-up iterations and randomly initialized matrices demonstrated a sustained 2,744 TFLOPS for FP8 workloads, which, while expectedly lower than the theoretical peak, is still double that of the H100. The impact of these improvements on real workloads indicates up to a 3x speedup on average for end-to-end training and inference workloads based on our early results. High-Bandwidth Memory (HBM) Bandwidth Memory bandwidth is the metric that governs data movements. Our benchmarks showed a peak memory bandwidth of 7.35 TB/s, achieving 92% of its theoretical peak of 7.9 TB/s. This efficiency mirrors that of the H100, which also operated close to its theoretical maximum, while reaching 2.5x faster data transfers. This speedup ensures that data-intensive tasks, such as training large-scale neural networks, are executed efficiently. NVBandwidth The ND GB200 v6 architecture significantly enhances AI workload performance with NVLink C2C, enabling a direct, high-speed connection between the GPU and host system. This design reduces latency and improves data transfer efficiency, making AI workloads faster and more scalable. Our NVBandwidth tests measured CPU-to-GPU and GPU-to-CPU transfer rates to be nearly 4× faster than the ND H100 v5. This improvement minimizes bottlenecks in data-intensive applications and optimizes data movement efficiency over previous GPU-powered virtual machines. In addition, it allows the GPU to readily access additional memory when needed, which can be quickly accessed via the C2C link. NCCL Bandwidth NVIDIA’s Collective Communications Library (NCCL) enables high-speed communication between GPUs within and across nodes. We built our tests to measure the speed of communication between GPUs over NVLink within one virtual machine. Hig-speed communication is instrumental as most enterprise workloads consist of large-scale distributed models. The ND GB200 v6’s NVLink achieved a bandwidth of approximately 680 GB/s, aligning with NVIDIA’s projections. Conclusion The ND GB200 v6 virtual machine, powered by the NVIDIA GB200 Blackwell GPUs, showcases substantial advancements in computational performance, memory bandwidth, and data transfer speeds compared to the previous generations of virtual machines. These improvements are pivotal for efficiently managing the increasing demands of AI workloads like generative and agentic use-cases. Following our Benchmarking Guide will provide early access to performance reviews of the innovations announced at GTC 2025, helping customers drive the next wave of AI on Azure’s purpose-built AI infrastructure.Experience Next-Gen HPC Innovation: AMD Lab Empowers ‘Try Before You Buy’ on Azure
In today’s fast-paced digital landscape, High-Performance Computing (HPC) is a critical engine powering innovation across industries—from automotive and aerospace to energy and manufacturing. To keep pace with escalating performance demands and the need for agile, risk-free testing environments, AMD has partnered with Microsoft and leading Independent Software Vendors (ISVs) to introduce the AMD HPC Innovation Lab. This pioneering sandbox environment on Azure is a “try before you buy” solution designed to empower customers to run their HPC workloads, assess performance, and experience AMD’s newest hardware innovations that deliver enhanced performance, scalability, and consistency—all without any financial commitments. Introducing the AMD Innovation Lab: A New Paradigm in Customer Engagement The AMD HPC Innovation Lab represents a paradigm shift in customer engagement for HPC solutions. Traditionally, organizations had to invest significant time and resources to build and manage on-premises testing environments, dealing with challenges such as hardware maintenance, scalability issues, and high operational costs. Without the opportunity to fully explore the benefits of cloud solutions through a trial offer, they often missed out on the advantages of cloud computing. With this innovative lab, customers now have the opportunity to experiment with optimized HPC environments in a simple, user-friendly interface. The process is straightforward: upload your input file or choose from the pre-configured options, run your workload, and then download your output file for analysis. This streamlined approach allows businesses to compare performance results on an apples-to-apples basis against other providers or existing on-premises setups. Empowering Decision Makers For Business Decision Makers (BDMs) and Technical Decision Makers (TDMs), the lab offers a compelling value proposition. It eliminates the complexities and uncertainties often associated with traditional testing environments by providing a risk-free opportunity to: Thoroughly Evaluate Performance: With access to AMD’s cutting-edge chipsets and Azure’s robust cloud infrastructure, organizations can conduct detailed proof-of-concept evaluations without incurring long-term costs. Accelerate Decision-Making: The streamlined testing process not only speeds up the evaluation phase but also accelerates the overall time to value, enabling organizations to make informed decisions quickly. Optimize Infrastructure: Created in partnership with ISVs and optimized by both AMD and Microsoft, the lab ensures that the infrastructure is fine-tuned for HPC workloads. This guarantees that performance assessments are both accurate and reflective of real-world scenarios. Seamless Integration with Leading ISVs A notable strength of the AMD HPC Innovation Lab is its collaborative design with top ISVs like Ansys, Altair, Siemens, and others. These partnerships ensure that the lab’s environment is equipped with industry-leading applications and solvers, such as Ansys Fluent for fluid dynamics and Ansys Mechanical for structural analysis. Each solver is pre-configured to provide a balanced and consistent performance evaluation, ensuring that users can benchmark their HPC workloads against industry standards with ease. Sustainability and Scalability Beyond performance and ease-of-use, the AMD HPC Innovation Lab is built with sustainability in mind. By leveraging Azure’s scalable cloud infrastructure, businesses can conduct HPC tests without the overhead and environmental impact of maintaining additional on-premises resources. This not only helps reduce operational costs but also supports corporate sustainability goals by minimizing the carbon footprint associated with traditional HPC setups. An Exciting Future for HPC Testing The innovation behind the AMD HPC Innovation Lab is just the beginning. With plans to continuously expand the lab catalog and include more ISVs, the platform is set to evolve as a comprehensive testing ecosystem. This ongoing expansion will provide customers with an increasingly diverse set of tools and environments tailored to meet a wide array of HPC needs. Whether you’re evaluating performance for fluid dynamics, structural simulations, or electromagnetic fields, the lab’s growing catalog promises to deliver precise and actionable insights. Ready to Experience the Future of HPC? The AMD HPC Innovation Lab on Azure offers a unique and exciting opportunity for organizations looking to harness the power of advanced computing without upfront financial risk. With its intuitive interface, optimized infrastructure, and robust ecosystem of ISVs, this sandbox environment is a game-changer in HPC testing and validation. Take advantage of this no-cost, high-impact solution to explore, experiment, and experience firsthand the benefits of AMD-powered HPC on Azure. To learn more and sign up for the program, visit https://aka.ms/AMDInnovationLab/LearnMore