CMS-CSCS kick-start

Name: CMS-CSCS kick-start
Start: 2018-05-15T14:00:00+02:00
End: 2018-05-15T15:00:00+02:00
Location: No location set

Tuesday 15 May 2018, 14:00 → 15:00 Europe/Zurich

Pablo Fernandez Fernandez (ETH Zurich (CH))

- 14:00 → 14:10
  Spec review 10m
  Node specs @ Piz Daint
  
  up to 150 nodes (shared with ATLAS and CSCS users, for the Tier-0)
  
  dual socket Xeon E5-2695 v4 @ 2.10GHz
  
  68 schedulable cores (HT enabled), for a total of ~10'000 cores
  
  128 GB RAM/node (no memory limits per job, small swap is available but not recommended), which means ~2 GB RAM/core
  
  Scratch shared with Tier-2 (has been recently reinforced with an SSD layer) of 700 TB
  
  Middleware @ Piz Daint (dedicated to Tier-0)
  
  4 ARC servers (2 for submission, all 4 for data staging)
  
  Queues not published on BDII
  
  Accounting not pushed to APEL
  
  CMS proposed Reconstruction workload
  
  8 thread processes (16 GB RAM)
  
  For 1000 cores, input cache needed is ~20 TB big (~1 week buffer) and read @ 500 MB/s
  
  Mostly pure streaming (push data to Scratch, process, send 50% of the data back to other sites)
  
  8-12 hour jobs
  
  CentOS & Singularity needed
  
  Integration @ CSCS
  
  ATLAS is few (2-3 days per week) high-priority workloads
  
  CMS is background tasks that can take 24x7 if possible
  
  Up to 150 nodes are available (can be less, depending on the load)
  
  CMS can potentially use all the nodes, and is scheduled out when ATLAS workloads show up
  
  All managed by the scheduler
- 14:10 → 14:45
  
  Q&A 35m
  
  Q&A session
  
  test2
- 14:45 → 15:00
  Next steps 15m
  Implement needed changes (TBD)
  
  [CSCS] Enable queue and endpoint for CMS (tell Stephan and Giuseppe)
  
  [CMS] Configure a new site on CMS factory
  
  Controlled Test
  
  [CMS] Functional tests (e.g. 2 nodes)
  
  [CMS] Small load test (e.g. 15 nodes, 1000 cores)
  
  [CMS] Scale-up test (up to 150 nodes) (Check Squid is enough)
  
  Production

Choose timezone

CMS-CSCS kick-start