l1joint_0306

Workshops and Working Groups

Murrough Landon - 6 March 2003
http://www.hep.ph.qmul.ac.uk/landon/talks

Overview

Recent DIG meeting(?)
Recent database workshop
ROD crate DAQ workshop (26 March)
Monitoring
Error handling and fault tolerance

Database Workshop (1)

Survey of databases in ATLAS: TC

Technical coordination: installation databases
- Will have full details of every component and subcomponent, eg cables, modules, etc installed in the cavern and USA15 - legal requirement for decommissioning
- Defines pieces of equipment and functional locations
- Will track locations (movement of spares etc) - ``equipment passports''
- Will (aim to) take copies of production databases or at least provide some common access to them? Excel templates for massive upload of data.
- Not sure if small projects could use it as their own production database?
- All to be presented via EDMS interface

Database Workshop (2)

Survey of databases in ATLAS: others

DCS database
Example detector database (muons)
Example TDAQ database (l1calo)
Online configuration
HLT view
Offline views (simulation, reconstruction, analysis)

Database Workshop (3)

Conditions database

Presentations by detectors (LAr, Muon)
LHCXX common project for conditions database
ATLAS offline view (IoV layer over POOL/ROOT)
Lisbon prototype
GRID access
Coherence of conditions and configuration data?
Lots of discussion: not so sure about the convergence...

Database Workshop (4)

L1Calo viewpoint

Reprise of common level 1 requirements (web page prepared by Thorsten Wengler last september and circulated to at1soft list)

http://cern.ch/Atlas/GROUPS/DAQTRIG/LEVEL1/software/level1_databases.html

Outline of L1Calo configuration data (with rough data sizes)
- calibrations: energy, pulse shape, timing
- crates, modules, cables, mapping to calorimeter channels
- hot and dead cell map
- firmware binaries (ideally not just pointers to files)
- multiple parallel versions required

Database Workshop (5)

L1Calo viewpoint (continued)

Conditions data (less well defined, especially numerically)
- store the complete conditions of each run
- cf common view that conditions DB should only hold what is of interest to the offline
- data from monitoring (histograms, statistics etc)
Feedback on present configuration database

Forthcoming ROD Crate DAQ Workshop

Ralf wants me to talk about our experience with the Online configuration database...

Monitoring (1)

Working group

Members: Francois Touchard, Hanspeter Beck, Sergei Kolos, Beniamino di Girolamo and me
Aim to produce TDR chapter and backup document - continue to worry about monitoring after the TDR too...?
Draft backup document prepared (still somewhat of a draft)
TDR chapter will be distilled from this
TDAQ network group are very keen to get numerical input about the ``monitoring matrix'', ie traffic between various sources and destinations of monitoring data
Problem: detectors (and trigger groups) are busy and not particularly concerned by the DAQ/HLT TDR timescale
Monitoring workshop held during ATLAS week: some input (mostly verbal, not talks) from detectors but no numbers

Monitoring (2)

L1Calo perspective

Existing document from Eric
http://www.hep.ph.qmul.ac.uk/l1calo/doc/pdf/MonReqs.pdf
Very detailed on some requirements, but not much on where and how much
We should try to expand on this area
And guess at some numbers...

Where to monitor

Monitoring at SFI/EF level and via ROD crate DAQ will be the most useful places for normal ATLAS physics running
Monitoring via the ROS is needed for testing and installation
Some calibrations expected to be done at ROD crate level (events monitored via ROD crate DAQ or maybe on board DSPs)
Other calibrations will require dedicated EF tasks (or offline)

Monitoring: Trigger Decision

Check the level 1 trigger is functioning correctly

For the calorimeter trigger, monitor the whole chain from calorimeter analogue electronics, through receiver stations, preprocessor, cluster and jet/energy processors to the CTP
Use full calorimeter readout, readout from the preprocessor and later stages of the trigger pipeline
Simulate the trigger functionality on a subset of events and look for errors
Full simulation needs the full event: ie via SFI or on dedicated EF nodes
Check both accepted events and sample of rejected events (selection of suitable monitoring triggers needs more thought)
Partial simulation (of the digital trigger electronics from the preprocessor onwards) could be done on event fragments collected via ROD crate DAQ (if that supports coherent samples across 2-3 ROD crates) [Livio: ``use ROS'']

Monitoring: Trigger Hardware

Monitor operational aspects of the trigger

Check for levels of crosstalk, measure bit error rates on links and across backplanes, reliability of links, etc using both hardware monitoring via VME at O(1Hz) and event data (some overlap with trigger decision monitoring, but not all errors may affect the resulting L1A)
Measure ``level 0'' rates and energy spectrum in each trigger tower: preprocessor hardware histogram per tower read by the crate CPUs (at 10Hz, but published less frequently)
Rapid detection of new hot or dead cells, also indication of real beam conditions
Check current calibrations are still optimal
Readout related aspects: buffer utilitisation, etc
Physical quantities (crates, temperatures, voltages, etc) via DCS

Monitoring: Trigger Rates and Performance

Detailed monitoring of rates and deadtime

Trigger rates through the system: primitives (internal calo/muon details), CTP inputs, prescales, deadtime, final trigger ``items'', etc
Rates per trigger tower (efficiency across eta-phi space)
Correlations between trigger rates
Correlations of trigger rates with beam conditions, etc
History plots
Etc...

Trigger performance

Detailed studies will be mainly done offline, but some online checks may also be useful?

Monitoring Matrix

Impact of monitoring on the network

We will surely use all available resources for monitoring - not sure what we absolutely need (and network bandwidth may not be the limiting factor)
Some expected traffic flows (no numbers yet):
- RODs (might have network interfaces?) to local workstations??
- crate CPUs to local workstations (hardware monitoring, ROD events)
- local workstations to central displays (histograms from local monitoring)
- events to EF tasks
- histograms/statistics from EF tasks to EF histogram collection
- various sources saving data to conditions database

Error Handling and Fault Tolerance (1)

Working group

Members: Doris Burckhart, Andre Bogaerts, Reiner Hauser, Beniamino di Girolamo and me
Aim to produce TDR chapter and backup document
Draft TDR chapter prepared (available via DAQ/HLT TDR page)
Backup document will be an expansion of this

Error Handling and Fault Tolerance (2)

Topics covered

Error message handling: transport and filtering (MRS)
Classification of errors (FATAL, ERROR, WARNING, ``INFO'') - also diagnostic and debug messages
Handling of errors: distributed expert systems proposed
- Hierarchy probably matching that of the run control
- Handle errors as close to their origin as possible (and if fixed, report successful ``INFO'')
- If not handled, pass to higher level, which may need to pass requests for actions back down again
- Fault tolerance: identify single points of failure, make recommendations on achieving fault tolerance within each subsystem (eg error in one component should not affect neighbouring components)
Try to collect error handling scenarios from subsystems

Error Handling and Fault Tolerance (3)

L1Calo scenario: dealing with a hot cell

Possible case for passing errors up and down different error handling and run control hierarchies
Hot cell detected (very quickly?) by PP rate monitoring error message
PP could suppress the tower automatically Conditions DB update?
Perhaps also seen quickly by calo monitoring if its a faulty component?
SFI/EF monitoring will (later) also see the hot cell in whole event data - or maybe by watching for Conditions DB updates or PP error messages??
Only SFI/EF can determine if its a component calo cell or trigger tower
If the former, the SFI/EF monitoring reports this and request PP system to reenable the tower (another Conditions DB update?)
If calo monitoring already suppressed the cell, later SFI/EF request can be ignored