# Software based readout driver evolution towards 1 MHz readout as part of the ATLAS HL-LHC upgrade

Serguei Kolos, University of California Irvine, USA



on behalf of the ATLAS TDAQ Collaboration

# **1. LHC Performance and ATLAS Evolution**

얽

ATLAS is one of the four major LHC experiments.

ATLAS is the largest detector ever constructed for a particle collider: 44 meters long and 25 meters in diameter

More than 100 million sensitive electronics channels are used to record the particles produced by LHC collisions.



## 2. ATLAS Trigger/DAQ System Evolution

HLTPU

HLTPU



#### LHC Performance evolution

|        | Period      | Energy<br>[TeV] | Peak<br>Lumi<br>[10 <sup>34</sup> cm <sup>-2</sup> s <sup>-1</sup> ] | Peak<br>Pileup |
|--------|-------------|-----------------|----------------------------------------------------------------------|----------------|
| Run 1  | 2009 - 2013 | 7 - 8           | 0.7                                                                  | 35             |
| Run 2  | 2015 - 2018 | 13              | 2                                                                    | 60             |
| Run 3  | 2022 - 2025 | 13.6            | 2                                                                    | 60             |
| Run 4+ | 2029 -      | 13.6 - 14       | 5 - 7.5                                                              | 140 - 200      |

Toroid magnets ansition radiation tracke Muon chambers Semiconductor tracke

ATLAS Trigger/DAQ system evolution mainly driven by the evolution of LHC performance.

High Luminosity LHC upgrade after Run 3 will require a major upgrade of the ATLAS TDAQ system



25m

# 3. FELIX & SW ROD Readout for Run 3 & 4

New Readout system is based on a custom PCIe card called FELIX







### Run 1 & 2

Readout Drivers (RODs) provide interface between Front-End (FE) and DAQ:

- VME boards developed and maintained by detectors
- Connected via point-to-point optical link to a custom to custom PCI/PCIe I/O cards (ROBIN/ RobinNP)
- I/O cards are hosted by Readout System (ROS) commodity computers
- ROSes transfer data to the High-Level Trigger (HLT) farm via a commodity switched network



HLTPU

HLTPU

HLTPU

ATLAS uses a mixture of the legacy and new **FELIX**-based readout systems

- **FELIX** is used to read out the Muon New Small Wheel detector, upgraded Barrel RPCs; new Liquid Argon calorimeter digital readout and Level 1 calorimeter trigger.
- A new component, known as the **Software Readout Driver (SW ROD**) has been developed:
- Receives data from FELIX
- Supports the legacy HLT interface



#### Run 4

New readout architecture is based on the **FELIX** system:

- New **Data Handler** is an evolution of the **SW ROD**
- **Data Handler** has the same functional requirements as **SW ROD**
- Performance requirements are substantially higher than for Run 3:
  - 1 MHz L1 rate (10x)
  - 4.6 TB/s data readout rate (20x)

# 4. SW ROD Event Building Algorithm Performance for Run 3





Run 3 version of the FELIX I/O card is a custom PCIe board with Gen 3 x 16 interface installed into a commodity computer:

• Up to 48 optical input links

Can be operated in several modes:

**GBT** Mode:

- 4.8 Gb/s per link input rate
- Each link can be split into multiple
- logical sub-links (E-Links)
- Up to 192 virtual E-Links per card
- Up to 9.6 Gb/s per link input rate • No virtual link subdivision

• 12 links at full speed for Run 3

• 24 links at full speed for Run 4

FULL Mode:

Commodity PC Commodity PC Network Switch SW ROD/ SW ROD / Data Handler Data Handler Application Application Commodity PC Commodity PC

→ 384 e-links (48 B)

→192 e-links (108 B)

→96 e-links (228 B)

←48 e-links (468 B)

←24 e-links (946 B)

# of Data Receiving Threads

10



- **IpGBT** is a new protocol for Low Power Gigabit Transceiver device that can transfer data at 10.24 Gb/s input rate
- Interlaken is a point-to-point protocol that support 25 Gb/s input rate

| Run 3 Performance Requirements |                    |                               | Run 4 Performance Requirements |                               |                     |              |      |                               |     |                               |                     |
|--------------------------------|--------------------|-------------------------------|--------------------------------|-------------------------------|---------------------|--------------|------|-------------------------------|-----|-------------------------------|---------------------|
|                                | Packet<br>Size (B) | Packet Rate per<br>Link (kHz) |                                | Packet Rate<br>per card (MHz) | Data Rate<br>(Gb/s) |              |      | Packet Rate<br>per Link (kHz) |     | Packet Rate per<br>card (MHz) | Data Rate<br>(Gb/s) |
| GBT<br>Mode                    | 40                 | 100                           | 192                            | 19.2                          | 6                   | GBT<br>Mode  | 64   | 1000                          | 384 | 384                           | 196                 |
| FULL<br>Mode                   | 5000               | 100                           | 12                             | 2.4                           | 50                  | FULL<br>Mode | 1024 | 1000                          | 24  | 24                            | 192                 |

# 6. Run 4 Performance Test Results



400

200

- allocated memory buffer

Data Receiving Threads are almost independent:

• Interaction happens when completed **slices** are inserted into the Event Assembly Map, through which complete Events are built

Event building rate scales almost linearly with the number of Data Receiving Threads

# 5. Run 4 Performance Test Setup

To verify how the Run 3 implementation of the SW ROD scales towards Run 4 requirements a dedicated testbed has been set up. Two server models have been tested:

| <b>Option #1: Run 3 SW RO</b>                                                                                                                         | OD Computer                                      | Option #2                                                                                                                                                              |
|-------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Dual Intel(R) Xeon(R) Go</li> <li>L1d cache: 32K, L1i ca</li> <li>L2 cache: 1024K</li> <li>L3 cache: 22528K</li> <li>96 GB of RAM</li> </ul> | old 5218 CPU @ 2.30GHz (16x2 cores)<br>ache: 32K | <ul> <li>AMD Epyc 7313P @ 3GHz (16 cores):</li> <li>L1d cache: 32K, L1i cache: 32K</li> <li>L2 cache: 512K</li> <li>L3 cache: 32768K</li> <li>128 GB of RAM</li> </ul> |
| Data Receiving Threads<br>berform event building:<br>Aggregate incoming<br>data chunks into<br>events<br>The number of Data                           | Data<br>Data<br>Data<br>Receiving<br>Thread      | Data<br>Publisher<br>Data<br>Publisher<br>Data<br>Publisher<br>Data<br>Publish data chu<br>of the given size<br>the given numbe                                        |

- 12 bytes-per-packet transport overhead:
- 190 Gb/s real bandwidth of the test network

| E-links per<br>GBT link | N <sub>E-Links</sub> | PacketSize (B) |
|-------------------------|----------------------|----------------|
| 1                       | 24                   | 946            |
| 1                       | 48                   | 468            |
| 2                       | 96                   | 228            |
| 4                       | 192                  | 108            |
| 8                       | 384                  | 48             |

- The number of required Data Receiving Threads increases proportionally to the number of E-Links
- The overhead produced by thread synchronization is insignificant



# 7. Conclusion

The High-Luminosity Large Hadron Collider (HL-LHC), expected to enter in operation in 2029, aims to increase LHC luminosity by a factor of 10 beyond its original design.

The new Readout system for the ATLAS experiment is based on the Front-End LInk eXchange (FELIX), introduced for some detectors in Run 3. A new component, called the SW ROD, has been developed to receive data from FELIX.

The Data Handler component of the Run 4 DAQ system will be an evolution of the SW ROD, that will support the same functional requirements but must be able to operate at an input rate of 1 MHz to cope with the HL-LHC luminosity.

Performance testing to date demonstrates that the Run 3 SW ROD application is able to process data at 1 MHz rate for realistic Run 4 input configurations.

It is expected that single CPU core performance should increase by at least 50% in the next 5 years, which will provide extra computing power and decrease overall system cost.