l1uk_1217_wgroups

Working Groups

Murrough Landon - 17 December 2002
http://www.hep.ph.qmul.ac.uk/landon/talks

Overview

Monitoring (1)

Group Members

Francois Touchard (EF), Sergei Kolos (Online), Hans Peter Beck (DataCollection), Beniamino di Girolamo (DIG), me (Level1)

Aims

Study monitoring requirements
Monitoring matrix: data volume between monitoring sources and destinations to plan the network
Operational monitoring of the DAQ system
Operational and physics monitoring of the trigger
Produce monitoring chapter in the TDR

Monitoring (2)

Plans

L1Calo specific

Monitoring sources: PPMs, ROD crate controller, EF monitoring tasks (not ROS in final system?)
Monitoring destinations: ROD crate workstations, EF tasks combining histograms from many EF nodes, operator workstations displaying results, conditions database for some histograms and other monitoring data
Volume of traffic is not clear (to me)
L1Calo monitoring document needs another look in this context

EH and FT (1)

Group Members

Doris Burkhardt (Online), Andre Bogaerts (Level2/DC), Reiner Hauser (Level2/DC), Beniamino di Girolamo (DIG), me (Level1)

Aims

Study error handling and fault tolerance in the TDAQ system, including error prevention(!), reporting, recovery, etc
Comment on (distributed) proposals coming from the Online group
Common classification of errors
Produce scenarios for handling errors
Identify single points of failure
Recommendations for producing a fault tolerant system

EH and FT (2)

L1Calo scenarios: prevention

L1Calo scenarios: robustness

Parity on links, monitor errors. Do we have parity on links to CTP?
Single points of failure: CTP! System CMMs and their CTP links? (Backup cabling to CTP from another crate?)

EH and FT (3)

L1Calo scenarios: reporting/recovery

Dead cells and bad links: report, disable, (typically) fairly small effect. For links can try resetting during run?
Dead modules, crates: depending on severity, may only make sense to reset outside a run?
Hot cells: may be in the trigger system or the calorimeter, may be detected by PPM crate controller (rate histograms) and/or by EF monitoring task. Crate controller can only disable the channel. EF task can diagnose calo cell problem, disable calo cell and enable PPM channel again? Potential problems with the same error being recovered in different ways at different places.

Databases

Activity

Lots of general discussion in TDAQ and ATLAS weeks
But not much activity of the working group itself...
Proposal for an ATLAS wide databases workshop early next year
Thorsten Wengler recently nominated as a new ATLAS wide role as database coordinator (not at a technical level)
Suggestion that L1Calo give a talk as a ``typical'' TDAQ system