Murrough Landon - 17 December 2002 http://www.hep.ph.qmul.ac.uk/landon/talks
Overview
Monitoring
Error handling and fault tolerance
Databases
Monitoring (1)
Group Members
Francois Touchard (EF), Sergei Kolos (Online),
Hans Peter Beck (DataCollection), Beniamino di Girolamo (DIG),
me (Level1)
Aims
Study monitoring requirements
Monitoring matrix: data volume between monitoring
sources and destinations to plan the network
Operational monitoring of the DAQ system
Operational and physics monitoring of the trigger
Produce monitoring chapter in the TDR
Monitoring (2)
Plans
Preliminary document already (level1 submission is late)
Questions to TDAQ subsystems and detectors
Monitoring workshop (around ATLAS week or database workshop?)
L1Calo specific
Monitoring sources: PPMs, ROD crate controller, EF monitoring tasks
(not ROS in final system?)
Monitoring destinations: ROD crate workstations, EF tasks combining
histograms from many EF nodes, operator workstations displaying results,
conditions database for some histograms and other monitoring data
Volume of traffic is not clear (to me)
L1Calo monitoring document needs another look in this context
EH and FT (1)
Group Members
Doris Burkhardt (Online), Andre Bogaerts (Level2/DC),
Reiner Hauser (Level2/DC), Beniamino di Girolamo (DIG),
me (Level1)
Aims
Study error handling and fault tolerance in the TDAQ system,
including error prevention(!), reporting, recovery, etc
Comment on (distributed) proposals coming from the Online group
Common classification of errors
Produce scenarios for handling errors
Identify single points of failure
Recommendations for producing a fault tolerant system
EH and FT (2)
L1Calo scenarios: prevention
Avoiding system wide firmware update catastrophe!
L1Calo scenarios: robustness
Parity on links, monitor errors. Do we have parity on links to CTP?
Single points of failure: CTP! System CMMs and their CTP links?
(Backup cabling to CTP from another crate?)
EH and FT (3)
L1Calo scenarios: reporting/recovery
Dead cells and bad links: report, disable, (typically) fairly small effect.
For links can try resetting during run?
Dead modules, crates: depending on severity, may only make sense to
reset outside a run?
Hot cells: may be in the trigger system or the calorimeter, may be
detected by PPM crate controller (rate histograms) and/or by EF
monitoring task. Crate controller can only disable the channel.
EF task can diagnose calo cell problem, disable calo cell and
enable PPM channel again? Potential problems with the same error
being recovered in different ways at different places.
Databases
Activity
Lots of general discussion in TDAQ and ATLAS weeks
But not much activity of the working group itself...
Proposal for an ATLAS wide databases workshop early next year
Thorsten Wengler recently nominated as a new ATLAS wide
role as database coordinator (not at a technical level)
Suggestion that L1Calo give a talk as a ``typical'' TDAQ system