8–10 Jul 2026
Europe/Zurich timezone
Registration is open and free!

Operational challenges of the Event Processing Nodes GPU farm at ALICE Experiment

9 Jul 2026, 15:00
15m
EITHER 15 minute talk or 5 minute 'flash' talk Submitted talks

Speaker

Federico Ronchetti (CERN)

Description

Operational challenges of the Event Processing Nodes GPU farm at ALICE Experiment
F. Ronchetti (CERN), G. Erba (Goethe U.) on behalf of the ALICE Collaboration
The ALICE Collaboration Event Processing Nodes (EPN) farm is a high-density GPU HPC infrastructure designed for real-time reconstruction of 50 kHz Pb–Pb collisions during CERN LHC Run 3. Comprising 350 nodes and 2800 GPUs and delivering about 42 PFLOP/s single-precision peak performance, it represents the largest computing farm at CERN in terms of compute capacity. Beyond raw performance, the EPN has been conceived and operated with sustainability as a central design principle, addressing energy efficiency, resource optimization, hardware longevity, and reduced operational overhead.
This contribution focuses on the architectural and organizational choices that have enabled a sustainable operation model for a physics-critical HPC facility maintained continuously by a small, dedicated team. Sustainability is addressed holistically, from compute efficiency to infrastructure services.
The talk discusses how sustainable design principles spanning cooling, power, hardware utilization and automation made possible continuous 24/7 operation of a large-scale GPU farm with low manpower requirements and reduced operational overhead.

Author

Presentation materials

There are no materials yet.