SHAREing Performance Monitoring of Accelerated Compute Workshop

Europe/London
Computational Foundry 101 (HYBRID—Zoom, or: Swansea University, Bay Campus)

Computational Foundry 101

HYBRID—Zoom, or: Swansea University, Bay Campus

Computational Foundry Bay Campus Fabian Way Swansea SA1 8EN
Description

The SHAREing Accelerated Compute Hub aims to share knowledge and skills relating to running, programming, and utilising shared accelerated compute platforms.

A key gap that we have identified in the knowledge space for system administration is understanding how to monitor utilisation, efficiency, and performance of software on accelerated compute. While there is a plethora of tools that can be run as part of a user job to profile its efficiency, and many tools that can be run to assess the energy consumption of a cluster as a whole and long-term utilisation of individual accelerators, there is not a clear set of best practices on how to holistically connect systems-level data with user jobs, such that users can monitor and be alerted to their resource utilisation without needing to explicitly profile each job.

This workshop brings together experts from hardware and software vendors and HPC centres to share experiences and best practices on how to connect hardware data to user workloads.

Confirmed speakers include:

  • Jordà Polo, AMD
  • Jan Eitzinger and Christoph Kluge, NHR@FAU (Cluster Cockpit)
  • Lee Davis, NVIDIA
  • Mahendra Paipuri, CNRS (CEEMS)
  • Rudy Shand, Linaro

 

We welcome both in-person and remote participation in this event. The Zoom link to participate remotely will be sent to registered participants in advance of the event.

Registration
    • 10:00 10:30
      Arrivals 30m
    • 10:30 11:00
      Workshop: Welcome and introduction
      • 10:30
        Welcome and introduction 20m
        Speaker: Ed Bennett (Swansea University)
    • 11:00 12:15
      Workshop: Current practice: presentations from HPC centres and discussion
    • 12:15 13:15
      Lunch 1h
    • 13:15 14:30
      Workshop: What tools are available? Presentations from tooling authors and vendors
      • 13:15
        ClusterCockpit - a job-specific performance and energy monitoring and optimization framework for HPC clusters 25m
        Speaker: Jan Eitzinger (NHR@FAU)
      • 13:40
        CEEMS: A Resource Manager Agnostic Energy, Emissions & Performance Monitoring Stack 25m

        With the rapid acceleration of ML/AI research in the last couple of years, the already energy-hungry HPC platforms have become even more demanding. A major part of this energy consumption is due to users’ workloads and it is only by the participation of end users that it is possible to reduce the overall energy consumption of the platforms. However, most of the HPC platforms do not provide any sort of metrics related to energy consumption, nor the performance metrics out of the box, which in turn do not encourage end users to optimize their workloads.

        The Compute Energy & Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for SLURM (HPC), Openstack (Cloud) and Kubernetes platforms alike. It leverages the Linux perf subsystem and eBPF to monitor the performance metrics of the applications, which can help the end users to identify the bottlenecks in their workflows rapidly and consequently optimize them to reduce the energy and carbon footprint. CEEMS supports eBPF-based continuous profiling and it is the first monitoring stack to support continuous profiling on HPC platforms. Another advantage of CEEMS is that it can systematically monitor all the jobs on the platform without the end users having to modify their workflows or codes.

        Besides CPU energy usage, it supports reporting energy usage and performance metrics of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability ecosystem, like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the HPC center operators to easily define the energy estimation rules of user workloads based on the underlying hardware. CEEMS monitors I/O and network metrics in a file system agnostic manner, allowing it to work on any parallel file system used by HPC platforms.

        Speaker: Mahendra Paipuri (CNRS)
      • 14:05
        GPU Performance Monitoring with Linaro Forge: An Application-Centric Approach and Future System-Wide Capabilities 25m

        This talk will explore the Linaro Forge toolsuite's capabilities for performance monitoring in accelerated compute environments, focusing on GPU utilization. We will present a light-weight profiler that allows application developers to gather essential, high-fidelity metrics offline using Linaro Performance Reports and enables deeper insight into performance issues using Linaro map. These tools provide minimal impact on application runtime and the design makes these tool easily integrable into scheduling scripts.

        While Linaro Forge tools are primarily application-developer focused, we will address the system-wide perspective by discussing how they could be used to provide visibility into job-level GPU metrics. We will also share and seek feedback on speculative future product ideas aimed at extending our tools to deliver a broader, sysadmin-focused view of GPU occupation, utilization, and energy consumption.

        Speaker: Rudy Shand (Linaro)
    • 14:30 15:00
      Coffee break 30m
    • 15:00 15:50
      Workshop: What tools are available? Presentations from tooling authors and vendors (continued)
    • 15:50 16:45
      Workshop: Where do we go next? Reflective discussions and hackathon ideation