Workshops and Working Groups
Murrough Landon - 6 March 2003
http://www.hep.ph.qmul.ac.uk/landon/talks
Overview
- Recent DIG meeting(?)
- Recent database workshop
- ROD crate DAQ workshop (26 March)
- Monitoring
- Error handling and fault tolerance
DIG Meeting
ROD contracts?
Database Workshop (1)
Survey of databases in ATLAS: TC
- Technical coordination: installation databases
- Will have full details of every component and subcomponent,
eg cables, modules, etc installed in the cavern and USA15
- legal requirement for decommissioning
- Defines pieces of equipment and functional locations
- Will track locations (movement of spares etc)
- ``equipment passports''
- Will (aim to) take copies of production databases or at least
provide some common access to them?
Excel templates for massive upload of data.
- Not sure if small projects could use it as their own
production database?
- All to be presented via EDMS interface
Database Workshop (2)
Survey of databases in ATLAS: others
- DCS database
- Example detector database (muons)
- Example TDAQ database (l1calo)
- Online configuration
- HLT view
- Offline views (simulation, reconstruction, analysis)
Database Workshop (3)
Conditions database
- Presentations by detectors (LAr, Muon)
- LHCXX common project for conditions database
- ATLAS offline view (IoV layer over POOL/ROOT)
- Lisbon prototype
- GRID access
- Coherence of conditions and configuration data?
- Lots of discussion: not so sure about the convergence...
Database Workshop (4)
L1Calo viewpoint
- Reprise of common level 1 requirements (web page prepared
by Thorsten Wengler last september and circulated to at1soft list)
http://cern.ch/Atlas/GROUPS/DAQTRIG/LEVEL1/software/level1_databases.html
- Outline of L1Calo configuration data (with rough data sizes)
- calibrations: energy, pulse shape, timing
- crates, modules, cables, mapping to calorimeter channels
- hot and dead cell map
- firmware binaries (ideally not just pointers to files)
- multiple parallel versions required
Database Workshop (5)
L1Calo viewpoint (continued)
- Conditions data (less well defined, especially numerically)
- store the complete conditions of each run
- cf common view that conditions DB should only hold
what is of interest to the offline
- data from monitoring (histograms, statistics etc)
- Feedback on present configuration database
Forthcoming ROD Crate DAQ Workshop
Ralf wants me to talk about our experience with the Online configuration database...
Monitoring (1)
Working group
- Members: Francois Touchard, Hanspeter Beck, Sergei Kolos,
Beniamino di Girolamo and me
- Aim to produce TDR chapter and backup document - continue
to worry about monitoring after the TDR too...?
- Draft backup document prepared (still somewhat of a draft)
- TDR chapter will be distilled from this
- TDAQ network group are very keen to get numerical input
about the ``monitoring matrix'', ie traffic between various
sources and destinations of monitoring data
- Problem: detectors (and trigger groups) are busy and not
particularly concerned by the DAQ/HLT TDR timescale
- Monitoring workshop held during ATLAS week: some input
(mostly verbal, not talks) from detectors but no numbers
Monitoring (2)
L1Calo perspective
- Existing document from Eric
http://www.hep.ph.qmul.ac.uk/l1calo/doc/pdf/MonReqs.pdf
- Very detailed on some requirements, but not much on where and how much
- We should try to expand on this area
- And guess at some numbers...
Where to monitor
- Monitoring at SFI/EF level and via ROD crate DAQ will be the most
useful places for normal ATLAS physics running
- Monitoring via the ROS is needed for testing and
installation
- Some calibrations expected to be done at ROD crate level
(events monitored via ROD crate DAQ or maybe on board DSPs)
- Other calibrations will require dedicated EF tasks (or offline)
Monitoring: Trigger Decision
Check the level 1 trigger is functioning correctly
- For the calorimeter trigger, monitor the whole chain from
calorimeter analogue electronics,
through receiver stations, preprocessor, cluster and
jet/energy processors to the CTP
- Use full calorimeter readout, readout from the preprocessor
and later stages of the trigger pipeline
- Simulate the trigger functionality on a subset of events
and look for errors
- Full simulation needs the full event: ie via SFI or on
dedicated EF nodes
- Check both accepted events and sample of rejected
events (selection of suitable monitoring triggers
needs more thought)
- Partial simulation (of the digital trigger electronics
from the preprocessor onwards) could be done on event
fragments collected via ROD crate DAQ (if that supports
coherent samples across 2-3 ROD crates) [Livio: ``use ROS'']
Monitoring: Trigger Hardware
Monitor operational aspects of the trigger
- Check for levels of crosstalk, measure bit error rates on links
and across backplanes, reliability of links, etc using both
hardware monitoring via VME at O(1Hz) and event data
(some overlap with trigger decision monitoring, but not all
errors may affect the resulting L1A)
- Measure ``level 0'' rates and energy spectrum in each trigger
tower: preprocessor hardware histogram per tower read by
the crate CPUs (at 10Hz, but published less frequently)
- Rapid detection of new hot or dead cells, also indication
of real beam conditions
- Check current calibrations are still optimal
- Readout related aspects: buffer utilitisation, etc
- Physical quantities (crates, temperatures, voltages, etc)
via DCS
Monitoring: Trigger Rates and Performance
Detailed monitoring of rates and deadtime
- Trigger rates through the system: primitives (internal calo/muon
details), CTP inputs, prescales, deadtime, final trigger ``items'', etc
- Rates per trigger tower (efficiency across eta-phi space)
- Correlations between trigger rates
- Correlations of trigger rates with beam conditions, etc
- History plots
- Etc...
Trigger performance
- Detailed studies will be mainly done offline, but some online
checks may also be useful?
Monitoring Matrix
Impact of monitoring on the network
- We will surely use all available resources for monitoring
- not sure what we absolutely need (and network bandwidth
may not be the limiting factor)
- Some expected traffic flows (no numbers yet):
- RODs (might have network interfaces?) to local workstations??
- crate CPUs to local workstations (hardware monitoring, ROD events)
- local workstations to central displays (histograms from local monitoring)
- events to EF tasks
- histograms/statistics from EF tasks to EF histogram collection
- various sources saving data to conditions database
Error Handling and Fault Tolerance (1)
Working group
- Members: Doris Burckhart, Andre Bogaerts, Reiner Hauser,
Beniamino di Girolamo and me
- Aim to produce TDR chapter and backup document
- Draft TDR chapter prepared (available via DAQ/HLT TDR page)
- Backup document will be an expansion of this
Error Handling and Fault Tolerance (2)
Topics covered
- Error message handling: transport and filtering (MRS)
- Classification of errors (FATAL, ERROR, WARNING, ``INFO'')
- also diagnostic and debug messages
- Handling of errors: distributed expert systems proposed
- Hierarchy probably matching that of the run control
- Handle errors as close to their origin as possible
(and if fixed, report successful ``INFO'')
- If not handled, pass to higher level, which may need
to pass requests for actions back down again
- Fault tolerance: identify single points of failure,
make recommendations on achieving fault tolerance
within each subsystem (eg error in one component
should not affect neighbouring components)
- Try to collect error handling scenarios from subsystems
Error Handling and Fault Tolerance (3)
L1Calo scenario: dealing with a hot cell
- Possible case for passing errors up and down different error
handling and run control hierarchies
- Hot cell detected (very quickly?) by PP rate monitoring
error message
- PP could suppress the tower automatically
Conditions DB update?
- Perhaps also seen quickly by calo monitoring if its a faulty
component?
- SFI/EF monitoring will (later) also see the hot cell in whole event data
- or maybe by watching for Conditions DB updates or PP error messages??
- Only SFI/EF can determine if its a component calo cell or trigger tower
- If the former, the SFI/EF monitoring reports this and request PP system
to reenable the tower (another Conditions DB update?)
- If calo monitoring already suppressed the cell,
later SFI/EF request can be ignored
Murrough Landon (m.p.j.landon@qmul.ac.uk)