Introduction ------------------------------------------------------------ dtquery is a CGI-based tool to query your mon downtime logs for specific downtime events, on specific hosts/groups/services, during specified date ranges, and to supply you with graphs summarizing the results. Downtime can also be queried on a per-host basis, even though mon doesn't support the feature officially. When most services fail, the monitor which detected the failure writes the names of the failed hosts into the summary line, and the summary field is recorded as part of the downtime log. When we are searching for "hosts", what we are actually doing is searching for "text in the summary field", but in most configurations, these are identical. dtquery was developed so that we could analyze our downtime records more effectively, and more easily answer questions like: 1) When are certain types of failures typically occurring? Are there time of day/week/month patterns? 2) Are certain hosts within hostgroups more vulnerable to failures than others? 3) Are certain services within hostgroups more vulnerable to failures than others? 4) When failures happen, how long do they last and what does the distribution of failure times look like? 5) Why should we use mon and not replace it with another open-source or commercial monitoring package that has more graphing/reporting features? dtquery was developed and tested on Solaris 7 (sparc). It should work on any UNIX that supports the underlying software (mon, perl, gd, gnuplot), basically, any system that can run mon and mon.cgi should also be capable of running dtquery. Installation Instructions ------------------------------------------------------------ 1. You must have a working mon installation that is generating downtime logs. See the mon documentation on how to specify this if you haven't already (you must specify 'dtlogging = yes' and 'dtlogfile = /path/to/dtlogfile' in your mon.cf file). You will also need a reasonable amount of downtime data in your logs in order for this tool to generate significant value and produce meaningful graphs. Download mon at: ftp://ftp.kernel.org/pub/software/admin/mon/ 2. Although it's not strictly required, you should also install a new version of mon.cgi, which is integrated into dtquery. mon.cgi is available from the same location where you got mon, and includes installation instructions: http://www.nam-shub.com/files/ 3. Install zlib and libpng on your system, if they are not already available. You may need to specify building shared libraries for some architectures, we needed to specify building shared libraries for libpng. Both png and zlib are available at: ftp://ftp.uu.net/graphics/png 4. Install a png-capable version of gd on your system. This includes all recent versions of gd, we used v1.8.3 and that is what we recommend. The gd graphics library is available from: http://www.boutell.com/gd/ 5. Make sure the requisite perl modules are installed. These modules are all available from CPAN (http://www.cpan.org). You will need: Mon::Client Statistics::Descriptive GD::Graph (requires GD, we used v1.4) The remaining required modules (CGI, Time::Local, Carp) all come with a standard perl5 build, but will also be available on CPAN. 6. Make sure gnuplot is installed. gnuplot is available from http://www.gnuplot.org/. We used v3.7.1, the latest version available at the time of release. Make sure you build gnuplot with png support (use the "--with-png" option during configure). 7. Test gnuplot to verify that it is properly installed and can output png files properly. # gnuplot gnuplot> set output '/tmp/test.png' gnuplot> set term png Terminal type set to 'png' Options are 'small monochrome' gnuplot> plot sin(x) gnuplot> exit Now view the resulting image in a web browser or image-viewing program to verify that the image is generated and that it looks like a sine wave. 8. Copy the dtquery.cgi script into your webserver's cgi-bin directory. If you are running Apache, DO NOT RUN THIS SCRIPT UNDER mod_perl, or else YOU WILL SUFFER SEVERE PERFORMANCE PENALTIES! This is because dtquery.cgi forks off external gnuplot processes, which under mod_perl, means that you are actually forking off an entire httpd process to accomplish each fork. Please see the following URL for more information about why mod_perl is a bad idea for dtquery.cgi: http://perl.apache.org/guide/performance.html#Forking_and_Executing_Subprocess 9. Edit the header portion of dtquery.cgi to reflect your mon configuration and your organization's defaults. If you're impatient, you can probably leave most of the settings alone, as they are set to reasonable defaults. Note that by default, dtquery is set up to query a live mon server for downtime log data (with $main::dtlog_source == "mon") . You may very well wish to separate the downtime log information onto a different machine, in this case the "files" option would be used for $main::dtlog_source. Usage Instructions ------------------------------------------------------------ 1. You will need a browser that supports Javascript. Netscape 4 and IE5 were both tested and will work. 2. Open up the dtquery web page in your browser. The page may take a few moments to load, since it actually makes a query to the mon server that you specified and retrieves all current groups, services, and hosts. Select your query criteria and go! From there, hopefully everything should be obvious. If it's not, then we did something wrong. Let us know how we can improve! What You Can Learn About Your Downtime From The Graphs ------------------------------------------------------------ You can learn a lot about your downtime from the graphs generated. You can also learn nothing, there's no guarantees that trends will be apparent or that they will be apparent (or actual trends, and not just coincidences). Asking the right questions of your data is not always an easy task, and sometimes, there's just no clear answers. Trends will not always pop out at you. To help you get more out of the graphs, here's some hints. * Downtime by Hour of Day - Also known as "The Bar Code Graph", this graph shows a binary representation of the state of the service. Red means failure, white means OK. Unless your timeframe is very small (1-2 days), or you have very few failures, it may be hard to get much out of this graph. But it is very good at showing you date ranges to dig deeper within. * Cumulative Downtime by Time of Day - This graph answers the question "what time of day is this host/group/service spending the most time in the failure state?" * Cumulative Downtime by Day of Week - This graph answers the question "what day of the week is this host/group/service spending the most time in the failure state?" * Failure Time Distribution - This graph shows you the exact distribution of your failure times, in minutes, on a logarithmic scale. It answers the questions "Are most of my failures short? Long? Is there a discernible pattern?" * Cumulative Downtime by Service - This graph answers the question "For a given group or groups, how is my downtime distributed among various services?" For example, how much time has your HTTP service failed relative to the ping service? * Cumulative Downtime by Group - This graph answers the question "For a given service or services, how is my downtime distributed among different hostgroups?" For example, which are the groups with the most minutes spent in the failure state? Performance ------------------------------------------------------------ dtquery was not designed as a super high-performance application. It reads in downtime logs, which are flat text files. We have tested the application in development with large datasets (13000+ event downtime log) and it performs acceptably on a 333MHz Sparc Ultra-5 for most queries. The main factor in the running time is the number of results returned. Searching a 13000-event logfile, getting 600 matches, and operating on those is not a big deal (2-3 seconds), but returning 13000 events will force dtquery to work hard for a good 30-40 seconds. Probably most of this time is in the sorting. We haven't done any performance tuning or profiling of the code, so there are probably significant opportunities for performance improvements. We felt that for data sets significantly larger than what we are dealing with now, a database would definitely be the way to go. Ongoing Maintenance ------------------------------------------------------------ One feature of the initial version of dtquery is that it does not clean up the graphs that it creates . This is so you can make graphs and then send the display URL to a colleague or post it on a web page. The current way things are, the URL that would be necessary to generate a graph would be really long and might not fit in a GET request, and would certainly be really ugly. The current way to deal with this is to implement a cron job that periodically cleans up the graph directory. For example: # remove all files in the dtquery graph directory that have not been # accessed for 14 days or more. 0 2 * * * /bin/find /tmp/dtquery-cache -atime +14 -exec /bin/rm {} \; In the future, we might: 1) Implement a cleanup job as part of dtquery itself. A small overhead in processing for each request, but the cache is kept relatively clean. 2) Stick with implementing cleanup as a cron job. 3) Shorten the query parameters, and allow the whole query to be embedded in the URL in a reasonable way. The tradeoff is disk storage vs. CPU utilization to generate the graphs. Credits ------------------------------------------------------------ The initial version of dtquery, including all Javascript and HTML coding, the query engine, and the presentation logic, were done by a colleague who wishes to remain anonymous, without her efforts, dtquery would not exist in any form. Graphing capabilities were added by Andrew Ryan (andrewr@mycfo.com). "Code Reuse" was done from Cricket (png spraying routines), Chart::Graph (gnuplot running routines), mon, mon.cgi, and our own internal trouble ticketing system.