Weekly meeting green compute team
Live notes: see Teams channel / on demand for non-UofM.
Zoom link:
https://cern.zoom.us/j/69108649411?pwd=BhqU0RERtnPf2gtK872m4gSM6izuZx.1
In this meeting, we will discuss status and next steps for the benchmarking project with HEPScore/HEPBenchmarks and joint work with Glasgow.
UofM green compute meeting
Round table
Caterina Doglioni - Professor, looking for dark matter at the ATLAS experiment @ LHC, work on software efficiency and sustainability online (trigger) and offline
Michael Sparks - senior RSE, improving software with lots of software engineering background
Sakshi Kumar - IITP Bangalore, electrical engineering. Internships on open source contributions. GSoC
Emanuele Simili - Glasgow, considering power efficiency and measuring performance of various hardware, using HEPScore and estimating performance/watt. Tried to develop a power accounting system in Glasgow, we have a baseline for scripts.
Alessandra Forti - Responsible for Tier-2, working in ATLAS operations for WLCG. Interested in environmental computing progress.
Tobias Fitschen - postdoc at UofM, working on pump prime project (seed funding) on measuring energy consumption of software on local cluster Noether
Domenico Giordano - CERN IT, benchmarking on WLCG, HEPScore benchmark, collecting interesting measurements and injecting these benchmarks in the workload management system of the experiments. Sustainability aspect: we could add power measurements and correlate with load/power of the machine, so it’s a way to connect different measurements. Enable studies that allow data analysis.
Robert Frank - UofM Tier-2 and Tier-3 responsible. Helping pump prime with hardware information and power user.
[Joined later] Sam Skipsey - Glasgow, WLCG expert
%%%
Emanuele - state of the art for Glasgow
Confluence page:
https://gridpp.atlassian.net/wiki/spaces/public/pages/250281986/GridPP+Sustainability+Page
Node exporters runs on every node and provides data
Collector: Prometheus/VictoriaMetric collects
Plotter: Grafana
Exporter: this is site specific, this may have to be customised because people use ElasticSearch…this part scrapes metrics.
- There are a few scripts to collect information, e.g. from IPMI (needs to be ran as root).
- Scripts are ran as a cron job and export the relevant metrics.
- The files are then read by node_exporter
VO information is also parsed, now hardcoded.
Collector: puts information together into an application that can be queried
- Prometheus stores 3 months of metric, VictoriaMetrics for 1 year. Then we can run queries e.g. to get the power usage by a group (different granularities).
- Metric can be aggregated, e.g. core usage per VO
- Validation of aggregation per node groups is difficult (e.g. comparison with overall IPMI) - site specific
Then things are plotted via Grafana.
Domenico - state of the art for HEPScore/HEPBenchmarks
HEPiX benchmarking WG on 14/05 [indico link: 1546602]
- Meeting every month, today at 4 pm UK: GPU workflows
Collection of power consumption can be site-specific.
If hardware power metrics is available to the batch system (e.g. via pilot jobs) then one can make measurements.
- EGI presented a similar project (GreenDigit)
Pilot approach: collect power consumption through a probe job
We don’t want to create an accounting system, we want to show that this is possible - submit a job with a benchmark probe to enable a number of studies (including discovering misconfigurations)
What we do:
- Have a script collecting power usage (in any way one wants), put it in a place where it can be read
- The benchmarking suite reads this information
- This has been tried at three different sites
Basic concept: power draw to power log.
We should have a place where we can put all these scripts, and then the HEPScore team will include in the hammercloud jobs.
Machine feature definition: see https://hepsoftwarefoundation.org/notes/HSF-TN-2016-02.pdf
Alessandra: you want to get the information independently?
Dpmenico: in the instructions here
https://w3.hepix.org/benchmarking/how_to_run_HS23.html (see GitHub)
You are executing the command together with the script (and it’s IPMI). Instead of this, you can take the information from a file.
Michael: for UofM, scripts to extract information for power between two timestamps using power supplies, already exported in CSV files (10 seconds intervals, can be more).
Domenico: can be doing something fancier, e.g. differentiation in time window, averaging…this is needed because we want a rolling average to avoid spikes, it’s a more accurate instant values.
We want this script in /proc, some information that one can access from different systems. Then we need a kernel module and a service, we can easily do this kind of things in our community. It’s a bit more complex to have a cron job.
%%%
Two directions for UofM work:
- Work with Emanuele on power accounting, try to make it UofM - specific as well, work on plotting
- Provide the information that HEPScore needs
Michael: what do you get out in text file format from the HEPScore from all sites? Then we can adapt.
Domenico: we need to see different cases of different hardware first, this is why this work is useful.
Accounting part is really valuable by Emanuele, it will require the involvement of some accounting or reporting system (AUDITOR, EGI, experiment…). Trying out things to prove to the community that this is doable is important, but we need to integrate with the accounting systems and these are sometimes different. We need to understand these as well.
We can do power consumption measurements locally, but at some point this code will need to evolve into something that integrates with existing tools that do the accounting.
This work allows collaboration with DESY/Glasgow, also proves common interest in sustainability.
On the topic of user activities for GPU workloads: we create a new HEPScore configuration (not the official one) that calls a new workload. We need to have that workload in a container. This is feasible.
Domenico: Working together with Glasgow: work on the node exporter, refine metrics and aggregation in Prometheus so that they can be used by others as well. We need to understand needs of specific sites and then generalise. This is the reason why we have declarative commands, we know that not everyone will have the same command.
Next steps:
-
Goal: have UofM as another pilot site for HEPBenchmarks
-
Michael and Sakshi will start working on refining Prometheus+PS scripts
- Starting / common point can be Emanuele’s node exporter and collector (aggregator) scripts
- This work will include improvements that Emanuele would like to see in the collector side of the Glasgow accounting - we will be in contact via e-mail to set up a wishlist of what is useful
-
Emanuele should share :
-
file (& knowledge needed to make the file) that is input to current version of HEPBenchmarks as a starting point
-
list of potential improvement he’d like to see on the aggregator scripts (e.g. averaging if not there yet?) so Michael and Sakshi can also work on that
- We all start attending HEPiX benchmarking meetings (first one is today with GPU workflows: https://indico.cern.ch/event/1555093/)
We will have a check-in on work done in two weeks time, Wednesday 18th at 2 pm CERN/1 pm UK (Caterina and Tobias will send out agenda and Zoom).