Site administration

Jobs

There are around 100 different metrics we collect on all jobs. There are thousands of ways to filter and visualize them but for site admins we prepared one dashboard with the most important information:

Keep in mind that this dashboard contains only information on finished jobs. For not yet finished jobs one need to look here. To make it show only information from one sites add a line like this to the search bar: sitnamee:MWT2. We suggest to try grouping nodes (based on CPU, storage, or network connectivity) and compare their performance.

Jobs IO

Site movers collect information on all the input/output data transfers from the job. This data gets collected and indexed in ES at both CERN and UChicago. The data contains information on all the files that job accessesed or wrote, file sizes, rates, and a lot of metadata (filenames,workernode, etc.). Sites can use this data to:

Starting point should be this Kibana dashboard. To make it show only information from one sites add a line like this to the search bar: sitnamee:MWT2. First one would look at the visualization named errors. Nodes with more than a few errors most probably need a deep inspection.

For a deeper investigation we suggest looking separately at WNs connected to the same switch. This is easiest done by making a search like this: sitename:MWT2 and hostname:uct2*. Big difference in performance can point to a problem with a switch WNs are connected to, or link from that switch to the storage. Feel free to make a special visualization that splits all your WNs in groups, save it (please prepend name of your site to names of your visualizations) so you don't have to redo it every day. You can also save a copy of the dashboard and customize it to your liking.

Wide Area Network issues

It is very important to quickly spot WAN connectivity issues. There are two easy ways to do it: