Gauge Columns

From zzat
Jump to: navigation, search


The gauge column is how we define the thresholds for zztat to use when deciding what's "normal" and what constitutes a situation demanding a warning or critical reaction.

Data Types

Due to the nature of how zztat implements gauge columns, the data type used must be specified and strictly adhered to. Three data types are currently supported for gauge columns: NUMBER, VARCHAR2 and DATE. When writing your gauge query, you can decide which data type your gauge columns will be using. The columns then have to be defined according to the gauge query. Note that zztat does not auto-detect this, but this may change in future versions.

Rendering

When zztat is rendering your base gauge query towards its final form that will ultimately be executed it will inject the gauge columns as part of the rendering. For example, assume the following gauge base query:

select z.begin_time, z.snap#, s.username, w.sid, w.event, round(count(*) over (partition by w.event) / count(*) over (partition by null),2)*100 as pct_sess,
       case when substr(w.event, 1,6) = 'latch:' then 1 else 0 end as is_latch
  from sys.v_$session s, zz$ses_wait w, zz$snap z
 where s.sid = w.sid
   and w.snap# = z.snap#
   and s.username is not null

zztat will then wrap the query as follows to add the gauge columns:

select * from (
select z.begin_time, z.snap#, s.username, w.sid, w.event, round(count(*) over (partition by w.event) / count(*) over (partition by null),2)*100 as pct_sess,
       case when substr(w.event, 1,6) = 'latch:' then 1 else 0 end as is_latch
  from sys.v_$session s, zz$ses_wait w, zz$snap z
 where s.sid = w.sid
   and w.snap# = z.snap#
   and s.username is not null
) where /* gauge col 1 */PCT_SESS > :ZZ$BCPCT_SESS1 and /* gauge col 2 */IS_LATCH = :ZZ$BCIS_LATCH2


The process of rendering the query takes place in a fully automatic fashion and ensures that gauges run at optimal performance. The query is then parsed and executed via dbms_sql.

Using this approach has the advantage that we're using Oracle SQL to process the data - which is almost always faster than doing it procedurally. To actually produce alerts and then trigger reactions, zztat also loops over the records, but this only happens after everything has already been filtered out and is a very small row count.

Columns and Informational Columns

A gauge column definition consists of a set of two columns, both of which must be valid columns in the gauge base query. The gauge column name itself defines the value we evaluate our thresholds against. The informational column is a helper column used when alerting or processing reactions.

Let's take a simple example:

begin
zz$manage.gauge_col_add
(
  gauge                 => 'TS_SPACE',
  column_name           => 'PCT_FREE',
  informational_column  => 'TS_NAME',
  column_operator       => zz$manage.OPER_LT,
  warn_limit_numeric    => 25,
  warn_message          => 'Tablespace ##INFO## has less than 25% space available',
  crit_limit_numeric    => 10,
  crit_message          => 'Tablespace ##INFO## has less than 10% space available'
);
end;
/

In the above gauge column definition, the gauge column is the PCT_FREE column, which returns the available space in the tablespace as a percentage. This is what we want to measure in this particular gauge. The informational column is then set to the TS_NAME, allowing us to produce meaningful messages by using the tablespace name.

We can further see the aforementioned data types in action - this gauge query selects the PCT_FREE value, which is a number. We therefore use the warn_limit_numeric and crit_limit_numeric parameters.

Message substitutions

The gauge column definition must always include a message that corresponds to the severity of the threshold - either warning or critical. These messages accept several substitution strings which can be embedded in the message text, and will be replaced by zztat automatically whenever the message is used:

Substitution Text Will be replaced with ...
##INFO## The value of the informational column as returned by the gauge query
##ROW## The value of the gauge column as returned by the gauge query (the actual row value we evaluated against the threshold)
##WARN_LIMIT## The warning threshold limit that was violated
##CRIT_LIMIT## The critical threshold limit that was violated
##DB_NAME## The name of the database the gauge ran on (refers to v$database.name)
##DB_UNIQUE_NAME## The unique name of the database the gauge ran on (refers to v$database.db_unique_name)
##HOST_NAME## The host name of the database where the gauge ran on (refers to v$instance.host_name)

Note that if you require multiple values to be passed on to your reaction for further filtering or processing, you could do so for example by producing a JSON string and returning that string as the informational column, or by producing a delimited concatenated string.

Column Operators

A proper value evaluation wouldn't be possible (or flexible) without multiple different operators. The gauge column definition must always include an operator. The following operators are available to zztat in its current version (this list may grow in future versions):

Operator SQL Expression Condition required to trigger an alert Supports Overrides
zz$manage.OPER_EQ = Row value of the gauge column must be equal to the warning or critical limit No
zz$manage.OPER_NE != Row value of the gauge column must not be equal to the warning or critical limit No
zz$manage.OPER_BW between Row value of the gauge column must be between the warning and critical limit No
zz$manage.OPER_NB not between Row value of the gauge column not be between the warning and critical limit No
zz$manage.OPER_LT < Row value of the gauge column must be less than the warning or critical limit Yes
zz$manage.OPER_GT > Row value of the gauge column must be greater than the warning or critical limit Yes

Note that you may directly use the numeric values of those operators, but it is recommended to use the provided constants in zz$manage, such as zz$manage.OPER_LT.

Example

Below is an example gauge monitoring ASM diskgroups and alerting when either the percentage free or the total usable mb free falls below a certain threshold:

begin
zz$manage.gauge_add
(
  name                  => 'ASM_DG_FREE',
  metric                => 'ASM_DG',
  query                 => q'[
select s.begin_time, d.snap#, d.name, d.usable_file_mb, 100-(d.usable_file_mb/d.total_mb*100) as usable_pct
  from zz$asm_dg d, zz$snap s
 where d.zz$db = s.zz$db
   and d.snap# = s.snap#
   and s.metric = 'ASM_DG'
]',
  comments              => 'Gauge free space in ASM diskgroups',
  warn_reaction         => 'ALERT_LOG_AND_EMAIL',
  warn_message          => '##HOST##: Diskgroup ##INFO## is starting to run low on space',
  crit_reaction         => 'ALERT_LOG_AND_EMAIL',
  crit_message          => '##HOST##: Diskgroup ##INFO## is running critically low on space',
  flags                 => zz$manage.AUTOSYNC
);
end;
/

begin
zz$manage.gauge_col_add
(
  gauge                 => 'ASM_DG_FREE',
  column_name           => 'USABLE_PCT',
  informational_column  => 'NAME',
  column_operator       => zz$manage.OPER_LT,
  warn_limit_numeric    => 20,
  warn_message          => 'Diskgroup ##INFO## has only ##ROW##% free space',
  crit_limit_numeric    => 10,
  crit_message          => 'Diskgroup ##INFO## has only ##ROW##% free space'
);
end;
/

begin
zz$manage.gauge_col_add
(
  gauge                 => 'ASM_DG_FREE',
  column_name           => 'USABLE_FILE_MB',
  informational_column  => 'NAME',
  column_operator       => zz$manage.OPER_LT,
  warn_limit_numeric    => 100000,
  warn_message          => 'Diskgroup ##INFO## has less than 100GB free space ( ##ROW## MB available )',
  crit_limit_numeric    => 50000,
  crit_message          => 'Diskgroup ##INFO## has less than 50GB free space ( ##ROW## MB available )'
);
end;
/
For more information on the various gauge procedures, please see the documentation on the zz$manage package.