25–29 May 2026
La Biodola - Isola d'Elba (Italy)
Europe/Rome timezone
Abstract submission deadline postponed to 2026, January 18th.

Pre-Conference Program

Overview

The organizers of Real Time 2026 are pleased to offer a two-day pre-conference program that will focus on several valuable open-source tools that support development of embedded Artificial Intelligence applications. This workshop will be held on Saturday and Sunday (May 23-24).

More details will be provided here as the program is being developed. If you have questions about this workshop, please contact Audrey Corbeil Therrien.


General day schedule (Saturday and Sunday)


9:00-12:30 ~ Morning workshop (presentation, hands-on exercises and breaks)
12:30-14:30 ~ Lunch
14:30-18:00 ~ Afternoon workshop (presentation, hands-on exercises and breaks)

 


Part 1 – Designing Artificial Intelligence for Embedded Systems

Trainer:  Audrey Corbeil Therrien, Université de Sherbrooke

Artificial Intelligence has grown immensely over the last decade and enables data analysis that was impossible before. However, the vast majority of AI is trained and runs on large GPUs where model size is not constrained, and latency is secondary. However, for real-time edge applications found in many physics experiments and low-resource items in the Internet of Things, the models must be much smaller and validated against very strict performance metrics.

In this introduction to Embedded AI, we will go over the basics of designing an AI system, define what the requirements are for a hardware-compatible AI system, introduce hardware-aware methods to reduce the size of a model, and discuss how to test and measure the performance of this type of system.

 


Part 2 – hls4ml

Instructors : Benjamin Ramhorst, Marius Snella Köppel, ETH Zürich, Switzerland

FPGAs provide unique advantages in the realm of machine learning acceleration. Unlike CPUs and GPUs, FPGAs allow for custom parallelism, data type precision, and dataflow tailored specifically to the workload. Their reconfigurability enables the design of optimised hardware circuits that can reduce latency, power consumption, and improve throughput. Some common examples of FPGA-accelerated neural networks include particle classification, in-network traffic sniffing, and image segmentation for autonomous vehicles.

In this tutorial, we will introduce and hold a hands-on demo on hls4ml, an open-source library for real-time deployment of neural networks on FPGAs. hls4ml allows a seamless conversion from high-level models (e.g., from Keras or PyTorch) to low-latency, low-power FPGA designs. In this tutorial, we will cover the design choices behind hls4ml, from deeply pipelined dataflow architectures to model quantization and pruning.

In the second, more advanced part of the tutorial, the focus will shift to quantization-aware training (QAT). This section will cover the motivation and theory behind QAT, demonstrate hands-on usage of the updated QKerasV3, and show how quantized models can be seamlessly integrated into the hls4ml workflow. During the hands-on exercises, participants will work with hls4ml’s Python API and explore the following topics:

  • Quantization-aware training using QKerasV3
  • Model conversion and synthesis with hls4ml
  • Analysis of model latency and FPGA resource utilisation
  • Tuning design parameters to optimise performance and resource usage

The tutorial will conclude with a live demonstration of neural network inference running on a real FPGA, showcasing the end-to-end workflow from training to deployment.

More information on hls4ml and QKerasV3:

 


Part 3 – Super Neural Architecture Codesign Package (SNAC-Pack)

Instructors: Dmitri Demler, Jason Weitz, University of California San Diego

Machine learning has become a critical tool for analysis and decision-making across a wide range of scientific domains, from particle physics to materials science. However, the deployment of neural networks in resource-constrained environments, such as hardware accelerators and edge devices, remains a significant challenge. This often requires specialized expertise in both neural architecture design and hardware optimization.

To address this challenge, we introduce the Super Neural Architecture Codesign Package (SNAC-Pack), an integrated framework that automates the discovery and optimization of neural network architectures specifically tailored for hardware deployment. SNAC-Pack combines two powerful tools: Neural Architecture Codesign, which performs a two stage neural architecture search for optimal models, and the Resource Utilization and Latency Estimator, which predicts how an architecture will perform when implemented on FPGA software.

SNAC-Pack streamlines the neural architecture design process by enabling researchers to automatically explore diverse architectures optimized for both task performance and hardware efficiency. By providing quick estimates of resource utilization and latency without requiring time-consuming synthesis, SNAC-Pack accelerates the development cycle. State-of-the-art compression techniques, such as quantization-aware training and pruning, further optimize the models, resulting in architectures that can be deployed to FPGA hardware.

This tutorial provides a hands-on introduction to SNAC-Pack, guiding participants through the complete workflow from dataset preparation to hardware deployment. By the end of the tutorial, attendees will be able to run SNAC-Pack for their own applications, achieving improvements in accuracy, latency, and resource utilization compared to naive hand-crafted approaches.

 


Part 4 - Coyote v2: Open-source Abstractions and Infrastructure for FPGAs

Trainer: Benjamin Ramhorst, ETH Zürich, Switzerland

As Moore’s Law and Dennard Scaling reach their limits, computing is shifting toward heterogeneous hardware for large-scale data processing. Cloud vendors are deploying accelerators, like GPUs, DPUs, and FPGAs, to meet the growing computational demands of ML and big data.

 

While FPGAs offer great flexibility and performance, practically integrating them in larger systems remains challenging due to the long development cycles and expertise required. To address this, we introduce Coyote v2, an open-source FPGA shell with high-level, OS-like abstractions. Broadly speaking, Coyote v2 strives to simplify the application deployment process and enable developers to solely focus on their application logic and its performance, rather than infrastructure development. By providing clear and simple-to-use interfaces in both hardware and software, Coyote v2 allows everyone to leverage the mentioned abstractions for customized acceleration offloads and build distributed and heterogeneous computer systems, consisting of many FPGAs, GPUs, and CPUs. Coyote v2 has been re-engineered for flexibility of use as the basis platform for multi-tenant accelerators, SmartNICs, and near-memory accelerators.

 

This tutorial will cover Coyote v2's vFPGAs, which enable users to seamlessly deploy arbitrary applications on FPGAs, the built-in networking stacks for distributed applications, and, finally, the shared virtual memory model, enabling FPGA interaction with other hardware (CPU, GPU, storage). Additionally, we will showcase Coyote's high-level software API, which enables easy, yet high-performance, interaction from C++ with the FPGA. Finally, we will showcase Coyote's integration with hls4ml, performing inference on a PCIe-attached FPGA from a few lines of Python.

 

More information on Coyote:

https://github.com/fpgasystems/Coyote

https://dl.acm.org/doi/abs/10.1145/3731569.3764845