CMS-CSCS kick-start

Europe/Zurich
Pablo Fernandez Fernandez (ETH Zurich (CH))
    • 14:00 14:10
      Spec review 10m

      Node specs @ Piz Daint

      • up to 150 nodes (shared with ATLAS and CSCS users, for the Tier-0)
      • dual socket Xeon E5-2695 v4 @ 2.10GHz
      • 68 schedulable cores (HT enabled), for a total of ~10'000 cores
      • 128 GB RAM/node (no memory limits per job, small swap is available but not recommended), which means ~2 GB RAM/core
      • Scratch shared with Tier-2 (has been recently reinforced with an SSD layer) of 700 TB

      Middleware @ Piz Daint (dedicated to Tier-0)

      • 4 ARC servers (2 for submission, all 4 for data staging)
      • Queues not published on BDII
      • Accounting not pushed to APEL

      CMS proposed Reconstruction workload

      • 8 thread processes (16 GB RAM)
      • For 1000 cores, input cache needed is ~20 TB big (~1 week buffer) and read @ 500 MB/s
      • Mostly pure streaming (push data to Scratch, process, send 50% of the data back to other sites)
      • 8-12 hour jobs
      • CentOS & Singularity needed

      Integration @ CSCS

      • ATLAS is few (2-3 days per week) high-priority workloads
      • CMS is background tasks that can take 24x7 if possible
      • Up to 150 nodes are available (can be less, depending on the load)
      • CMS can potentially use all the nodes, and is scheduled out when ATLAS workloads show up
      • All managed by the scheduler
    • 14:10 14:45
      Q&A 35m

      Q&A session

      test2

    • 14:45 15:00
      Next steps 15m
      • Implement needed changes (TBD)
        • [CSCS] Enable queue and endpoint for CMS (tell Stephan and Giuseppe)
        • [CMS] Configure a new site on CMS factory
      • Controlled Test
        1. [CMS] Functional tests (e.g. 2 nodes)
        2. [CMS] Small load test (e.g. 15 nodes, 1000 cores)
        3. [CMS] Scale-up test (up to 150 nodes) (Check Squid is enough)
      • Production