Weekly meeting green compute team

Name: Weekly meeting green compute team
Start: 2025-06-04T13:00:00+01:00
End: 2025-06-04T14:00:00+01:00
Location: No location set

Wednesday 4 Jun 2025, 13:00 → 14:00 Europe/London

Alessandra Forti (The University of Manchester (GB)), Caterina Doglioni (The University of Manchester (GB)), Michael Sparks (The University of Manchester (GB)), Robert Frank (University of Manchester), Tobias Fitschen (The University of Manchester (GB))

Description

Live notes: see Teams channel / on demand for non-UofM.

Zoom link:

https://cern.zoom.us/j/69108649411?pwd=BhqU0RERtnPf2gtK872m4gSM6izuZx.1

In this meeting, we will discuss status and next steps for the benchmarking project with HEPScore/HEPBenchmarks and joint work with Glasgow.

Hide

UofM green compute meeting

Round table

Caterina Doglioni - Professor, looking for dark matter at the ATLAS experiment @ LHC, work on software efficiency and sustainability online (trigger) and offline

Michael Sparks - senior RSE, improving software with lots of software engineering background

Sakshi Kumar - IITP Bangalore, electrical engineering. Internships on open source contributions. GSoC

Emanuele Simili - Glasgow, considering power efficiency and measuring performance of various hardware, using HEPScore and estimating performance/watt. Tried to develop a power accounting system in Glasgow, we have a baseline for scripts.

Alessandra Forti - Responsible for Tier-2, working in ATLAS operations for WLCG. Interested in environmental computing progress.

Tobias Fitschen - postdoc at UofM, working on pump prime project (seed funding) on measuring energy consumption of software on local cluster Noether

Domenico Giordano - CERN IT, benchmarking on WLCG, HEPScore benchmark, collecting interesting measurements and injecting these benchmarks in the workload management system of the experiments. Sustainability aspect: we could add power measurements and correlate with load/power of the machine, so it’s a way to connect different measurements. Enable studies that allow data analysis.

Robert Frank - UofM Tier-2 and Tier-3 responsible. Helping pump prime with hardware information and power user.

[Joined later] Sam Skipsey - Glasgow, WLCG expert

%%%

Emanuele - state of the art for Glasgow

Confluence page:

https://gridpp.atlassian.net/wiki/spaces/public/pages/250281986/GridPP+Sustainability+Page

Node exporters runs on every node and provides data

Collector: Prometheus/VictoriaMetric collects

Plotter: Grafana

Exporter: this is site specific, this may have to be customised because people use ElasticSearch…this part scrapes metrics.

There are a few scripts to collect information, e.g. from IPMI (needs to be ran as root).
Scripts are ran as a cron job and export the relevant metrics.
The files are then read by node_exporter

VO information is also parsed, now hardcoded.

Collector: puts information together into an application that can be queried

Prometheus stores 3 months of metric, VictoriaMetrics for 1 year. Then we can run queries e.g. to get the power usage by a group (different granularities).
Metric can be aggregated, e.g. core usage per VO
Validation of aggregation per node groups is difficult (e.g. comparison with overall IPMI) - site specific

Then things are plotted via Grafana.

Domenico - state of the art for HEPScore/HEPBenchmarks

HEPiX benchmarking WG on 14/05 [indico link: 1546602]

Meeting every month, today at 4 pm UK: GPU workflows

Collection of power consumption can be site-specific.

If hardware power metrics is available to the batch system (e.g. via pilot jobs) then one can make measurements.

EGI presented a similar project (GreenDigit)

Pilot approach: collect power consumption through a probe job

We don’t want to create an accounting system, we want to show that this is possible - submit a job with a benchmark probe to enable a number of studies (including discovering misconfigurations)

What we do:

Have a script collecting power usage (in any way one wants), put it in a place where it can be read
The benchmarking suite reads this information
This has been tried at three different sites

Basic concept: power draw to power log.

We should have a place where we can put all these scripts, and then the HEPScore team will include in the hammercloud jobs.

Machine feature definition: see https://hepsoftwarefoundation.org/notes/HSF-TN-2016-02.pdf

Alessandra: you want to get the information independently?

Dpmenico: in the instructions here

https://w3.hepix.org/benchmarking/how_to_run_HS23.html (see GitHub)

You are executing the command together with the script (and it’s IPMI). Instead of this, you can take the information from a file.

Michael: for UofM, scripts to extract information for power between two timestamps using power supplies, already exported in CSV files (10 seconds intervals, can be more).

Domenico: can be doing something fancier, e.g. differentiation in time window, averaging…this is needed because we want a rolling average to avoid spikes, it’s a more accurate instant values.

We want this script in /proc, some information that one can access from different systems. Then we need a kernel module and a service, we can easily do this kind of things in our community. It’s a bit more complex to have a cron job.

%%%

Two directions for UofM work:

Work with Emanuele on power accounting, try to make it UofM - specific as well, work on plotting
Provide the information that HEPScore needs

Michael: what do you get out in text file format from the HEPScore from all sites? Then we can adapt.

Domenico: we need to see different cases of different hardware first, this is why this work is useful.

Accounting part is really valuable by Emanuele, it will require the involvement of some accounting or reporting system (AUDITOR, EGI, experiment…). Trying out things to prove to the community that this is doable is important, but we need to integrate with the accounting systems and these are sometimes different. We need to understand these as well.

We can do power consumption measurements locally, but at some point this code will need to evolve into something that integrates with existing tools that do the accounting.

This work allows collaboration with DESY/Glasgow, also proves common interest in sustainability.

On the topic of user activities for GPU workloads: we create a new HEPScore configuration (not the official one) that calls a new workload. We need to have that workload in a container. This is feasible.

Domenico: Working together with Glasgow: work on the node exporter, refine metrics and aggregation in Prometheus so that they can be used by others as well. We need to understand needs of specific sites and then generalise. This is the reason why we have declarative commands, we know that not everyone will have the same command.

Next steps:

Goal: have UofM as another pilot site for HEPBenchmarks
Michael and Sakshi will start working on refining Prometheus+PS scripts

Starting / common point can be Emanuele’s node exporter and collector (aggregator) scripts
This work will include improvements that Emanuele would like to see in the collector side of the Glasgow accounting - we will be in contact via e-mail to set up a wishlist of what is useful

Emanuele should share :

file (& knowledge needed to make the file) that is input to current version of HEPBenchmarks as a starting point
list of potential improvement he’d like to see on the aggregator scripts (e.g. averaging if not there yet?) so Michael and Sakshi can also work on that

We all start attending HEPiX benchmarking meetings (first one is today with GPU workflows: https://indico.cern.ch/event/1555093/)

We will have a check-in on work done in two weeks time, Wednesday 18th at 2 pm CERN/1 pm UK (Caterina and Tobias will send out agenda and Zoom).

There are minutes attached to this event. Show them.

- 13:00 → 13:20
  
  Discussion of upcoming work with HEPBenchmarks CERN and Glasgow (TBC) groups 20m
  
  Speakers: Alessandra Forti (The University of Manchester (GB)), Caterina Doglioni (The University of Manchester (GB)), Domenico Giordano (CERN), Mr Michael Sparks (The University of Manchester (GB)), Tobias Fitschen (The University of Manchester (GB))
- 13:20 → 13:40
  
  Student updates 20m
  
  Speaker: Caterina Doglioni (The University of Manchester (GB))
- 13:40 → 14:00
  
  Blackett team updates 20m
  
  Speakers: Alessandra Forti (The University of Manchester (GB)), Mr Robert Frank (University of Manchester)

Choose timezone

Weekly meeting green compute team