Histogram Design Notes

Author

Tom Zanussi <zanussi@kernel.org>

This document attempts to provide a description of how the ftrace histograms work and how the individual pieces map to the data structures used to implement them in trace_events_hist.c and tracing_map.c.

Note: All the ftrace histogram command examples assume the working

directory is the ftrace /tracing directory. For example:

# cd /sys/kernel/debug/tracing

Also, the histogram output displayed for those commands will be generally be truncated - only enough to make the point is displayed.

‘hist_debug’ trace event files

If the kernel is compiled with CONFIG_HIST_TRIGGERS_DEBUG set, an event file named ‘hist_debug’ will appear in each event’s subdirectory. This file can be read at any time and will display some of the hist trigger internals described in this document. Specific examples and output will be described in test cases below.

Basic histograms

First, basic histograms. Below is pretty much the simplest thing you can do with histograms - create one with a single key on a single event and cat the output:

# echo 'hist:keys=pid' >> events/sched/sched_waking/trigger

# cat events/sched/sched_waking/hist

{ pid:      18249 } hitcount:          1
{ pid:      13399 } hitcount:          1
{ pid:      17973 } hitcount:          1
{ pid:      12572 } hitcount:          1
...
{ pid:         10 } hitcount:        921
{ pid:      18255 } hitcount:       1444
{ pid:      25526 } hitcount:       2055
{ pid:       5257 } hitcount:       2055
{ pid:      27367 } hitcount:       2055
{ pid:       1728 } hitcount:       2161

Totals:
  Hits: 21305
  Entries: 183
  Dropped: 0

What this does is create a histogram on the sched_waking event using pid as a key and with a single value, hitcount, which even if not explicitly specified, exists for every histogram regardless.

The hitcount value is a per-bucket value that’s automatically incremented on every hit for the given key, which in this case is the pid.

So in this histogram, there’s a separate bucket for each pid, and each bucket contains a value for that bucket, counting the number of times sched_waking was called for that pid.

Each histogram is represented by a hist_data struct.

To keep track of each key and value field in the histogram, hist_data keeps an array of these fields named fields[]. The fields[] array is an array containing struct hist_field representations of each histogram val and key in the histogram (variables are also included here, but are discussed later). So for the above histogram we have one key and one value; in this case the one value is the hitcount value, which all histograms have, regardless of whether they define that value or not, which the above histogram does not.

Each struct hist_field contains a pointer to the ftrace_event_field from the event’s trace_event_file along with various bits related to that such as the size, offset, type, and a hist_field_fn_t function, which is used to grab the field’s data from the ftrace event buffer (in most cases - some hist_fields such as hitcount don’t directly map to an event field in the trace buffer - in these cases the function implementation gets its value from somewhere else). The flags field indicates which type of field it is - key, value, variable, variable reference, etc., with value being the default.

The other important hist_data data structure in addition to the fields[] array is the tracing_map instance created for the histogram, which is held in the .map member. The tracing_map implements the lock-free hash table used to implement histograms (see kernel/trace/tracing_map.h for much more discussion about the low-level data structures implementing the tracing_map). For the purposes of this discussion, the tracing_map contains a number of buckets, each bucket corresponding to a particular tracing_map_elt object hashed by a given histogram key.

Below is a diagram the first part of which describes the hist_data and associated key and value fields for the histogram described above. As you can see, there are two fields in the fields array, one val field for the hitcount and one key field for the pid key.

Below that is a diagram of a run-time snapshot of what the tracing_map might look like for a given run. It attempts to show the relationships between the hist_data fields and the tracing_map elements for a couple hypothetical keys and values.:

+------------------+
| hist_data        |
+------------------+     +----------------+
  | .fields[]      |---->| val = hitcount |----------------------------+
  +----------------+     +----------------+                            |
  | .map           |       | .size        |                            |
  +----------------+       +--------------+                            |
                           | .offset      |                            |
                           +--------------+                            |
                           | .fn()        |                            |
                           +--------------+                            |
                                 .                                     |
                                 .                                     |
                                 .                                     |
                         +----------------+ <--- n_vals                |
                         | key = pid      |----------------------------|--+
                         +----------------+                            |  |
                           | .size        |                            |  |
                           +--------------+                            |  |
                           | .offset      |                            |  |
                           +--------------+                            |  |
                           | .fn()        |                            |  |
                         +----------------+ <--- n_fields              |  |
                         | unused         |                            |  |
                         +----------------+                            |  |
                           |              |                            |  |
                           +--------------+                            |  |
                           |              |                            |  |
                           +--------------+                            |  |
                           |              |                            |  |
                           +--------------+                            |  |
                                          n_keys = n_fields - n_vals   |  |

The hist_data n_vals and n_fields delineate the extent of the fields[] | | array and separate keys from values for the rest of the code. | |

Below is a run-time representation of the tracing_map part of the | | histogram, with pointers from various parts of the fields[] array | | to corresponding parts of the tracing_map. | |

The tracing_map consists of an array of tracing_map_entrys and a set | | of preallocated tracing_map_elts (abbreviated below as map_entry and | | map_elt). The total number of map_entrys in the hist_data.map array = | | map->max_elts (actually map->map_size but only max_elts of those are | | used. This is a property required by the map_insert() algorithm). | |

If a map_entry is unused, meaning no key has yet hashed into it, its | | .key value is 0 and its .val pointer is NULL. Once a map_entry has | | been claimed, the .key value contains the key’s hash value and the | | .val member points to a map_elt containing the full key and an entry | | for each key or value in the map_elt.fields[] array. There is an | | entry in the map_elt.fields[] array corresponding to each hist_field | | in the histogram, and this is where the continually aggregated sums | | corresponding to each histogram value are kept. | |

The diagram attempts to show the relationship between the | | hist_data.fields[] and the map_elt.fields[] with the links drawn | | between diagrams:

+-----------+                                                          |  |
| hist_data |                                                          |  |
+-----------+                                                          |  |
  | .fields |                                                          |  |
  +---------+     +-----------+                                        |  |
  | .map    |---->| map_entry |                                        |  |
  +---------+     +-----------+                                        |  |
                    | .key    |---> 0                                  |  |
                    +---------+                                        |  |
                    | .val    |---> NULL                               |  |
                  +-----------+                                        |  |
                  | map_entry |                                        |  |
                  +-----------+                                        |  |
                    | .key    |---> pid = 999                          |  |
                    +---------+    +-----------+                       |  |
                    | .val    |--->| map_elt   |                       |  |
                    +---------+    +-----------+                       |  |
                         .           | .key    |---> full key *        |  |
                         .           +---------+    +---------------+  |  |
                         .           | .fields |--->| .sum (val)    |<-+  |
                  +-----------+      +---------+    | 2345          |  |  |
                  | map_entry |                     +---------------+  |  |
                  +-----------+                     | .offset (key) |<----+
                    | .key    |---> 0               | 0             |  |  |
                    +---------+                     +---------------+  |  |
                    | .val    |---> NULL                    .          |  |
                  +-----------+                             .          |  |
                  | map_entry |                             .          |  |
                  +-----------+                     +---------------+  |  |
                    | .key    |                     | .sum (val) or |  |  |
                    +---------+    +---------+      | .offset (key) |  |  |
                    | .val    |--->| map_elt |      +---------------+  |  |
                  +-----------+    +---------+      | .sum (val) or |  |  |
                  | map_entry |                     | .offset (key) |  |  |
                  +-----------+                     +---------------+  |  |
                    | .key    |---> pid = 4444                         |  |
                    +---------+    +-----------+                       |  |
                    | .val    |    | map_elt   |                       |  |
                    +---------+    +-----------+                       |  |
                                     | .key    |---> full key *        |  |
                                     +---------+    +---------------+  |  |
                                     | .fields |--->| .sum (val)    |<-+  |
                                     +---------+    | 65523         |     |
                                                    +---------------+     |
                                                    | .offset (key) |<----+
                                                    | 0             |
                                                    +---------------+
                                                            .
                                                            .
                                                            .
                                                    +---------------+
                                                    | .sum (val) or |
                                                    | .offset (key) |
                                                    +---------------+
                                                    | .sum (val) or |
                                                    | .offset (key) |
                                                    +---------------+

Abbreviations used in the diagrams:

hist_data = struct hist_trigger_data
hist_data.fields = struct hist_field
fn = hist_field_fn_t
map_entry = struct tracing_map_entry
map_elt = struct tracing_map_elt
map_elt.fields = struct tracing_map_field

Whenever a new event occurs and it has a hist trigger associated with it, event_hist_trigger() is called. event_hist_trigger() first deals with the key: for each subkey in the key (in the above example, there is just one subkey corresponding to pid), the hist_field that represents that subkey is retrieved from hist_data.fields[] and the hist_field_fn_t fn() associated with that field, along with the field’s size and offset, is used to grab that subkey’s data from the current trace record.

Once the complete key has been retrieved, it’s used to look that key up in the tracing_map. If there’s no tracing_map_elt associated with that key, an empty one is claimed and inserted in the map for the new key. In either case, the tracing_map_elt associated with that key is returned.

Once a tracing_map_elt available, hist_trigger_elt_update() is called. As the name implies, this updates the element, which basically means updating the element’s fields. There’s a tracing_map_field associated with each key and value in the histogram, and each of these correspond to the key and value hist_fields created when the histogram was created. hist_trigger_elt_update() goes through each value hist_field and, as for the keys, uses the hist_field’s fn() and size and offset to grab the field’s value from the current trace record. Once it has that value, it simply adds that value to that field’s continually-updated tracing_map_field.sum member. Some hist_field fn()s, such as for the hitcount, don’t actually grab anything from the trace record (the hitcount fn() just increments the counter sum by 1), but the idea is the same.

Once all the values have been updated, hist_trigger_elt_update() is done and returns. Note that there are also tracing_map_fields for each subkey in the key, but hist_trigger_elt_update() doesn’t look at them or update anything - those exist only for sorting, which can happen later.

Basic histogram test

This is a good example to try. It produces 3 value fields and 2 key fields in the output:

# echo 'hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger

To see the debug data, cat the kmem/kmalloc’s ‘hist_debug’ file. It will show the trigger info of the histogram it corresponds to, along with the address of the hist_data associated with the histogram, which will become useful in later examples. It then displays the number of total hist_fields associated with the histogram along with a count of how many of those correspond to keys and how many correspond to values.

It then goes on to display details for each field, including the field’s flags and the position of each field in the hist_data’s fields[] array, which is useful information for verifying that things internally appear correct or not, and which again will become even more useful in further examples:

# cat events/kmem/kmalloc/hist_debug

# event histogram
#
# trigger info: hist:keys=common_pid,call_site.sym:vals=hitcount,bytes_req,bytes_alloc:sort=hitcount:size=2048 [active]
#

hist_data: 000000005e48c9a5

n_vals: 3
n_keys: 2
n_fields: 5

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      VAL: normal u64 value
    ftrace_event_field name: bytes_req
    type: size_t
    size: 8
    is_signed: 0

  hist_data->fields[2]:
    flags:
      VAL: normal u64 value
    ftrace_event_field name: bytes_alloc
    type: size_t
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[3]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: common_pid
    type: int
    size: 8
    is_signed: 1

  hist_data->fields[4]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: call_site
    type: unsigned long
    size: 8
    is_signed: 0

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger

Variables

Variables allow data from one hist trigger to be saved by one hist trigger and retrieved by another hist trigger. For example, a trigger on the sched_waking event can capture a timestamp for a particular pid, and later a sched_switch event that switches to that pid event can grab the timestamp and use it to calculate a time delta between the two events:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
        events/sched/sched_waking/trigger

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >>
        events/sched/sched_switch/trigger

In terms of the histogram data structures, variables are implemented as another type of hist_field and for a given hist trigger are added to the hist_data.fields[] array just after all the val fields. To distinguish them from the existing key and val fields, they’re given a new flag type, HIST_FIELD_FL_VAR (abbreviated FL_VAR) and they also make use of a new .var.idx field member in struct hist_field, which maps them to an index in a new map_elt.vars[] array added to the map_elt specifically designed to store and retrieve variable values. The diagram below shows those new elements and adds a new variable entry, ts0, corresponding to the ts0 variable in the sched_waking trigger above.

sched_waking histogram ———————-:

+------------------+
| hist_data        |<-------------------------------------------------------+
+------------------+   +-------------------+                                |
  | .fields[]      |-->| val = hitcount    |                                |
  +----------------+   +-------------------+                                |
  | .map           |     | .size           |                                |
  +----------------+     +-----------------+                                |
                         | .offset         |                                |
                         +-----------------+                                |
                         | .fn()           |                                |
                         +-----------------+                                |
                         | .flags          |                                |
                         +-----------------+                                |
                         | .var.idx        |                                |
                       +-------------------+                                |
                       | var = ts0         |                                |
                       +-------------------+                                |
                         | .size           |                                |
                         +-----------------+                                |
                         | .offset         |                                |
                         +-----------------+                                |
                         | .fn()           |                                |
                         +-----------------+                                |
                         | .flags & FL_VAR |                                |
                         +-----------------+                                |
                         | .var.idx        |----------------------------+-+ |
                         +-----------------+                            | | |
                                  .                                     | | |
                                  .                                     | | |
                                  .                                     | | |
                       +-------------------+ <--- n_vals                | | |
                       | key = pid         |                            | | |
                       +-------------------+                            | | |
                         | .size           |                            | | |
                         +-----------------+                            | | |
                         | .offset         |                            | | |
                         +-----------------+                            | | |
                         | .fn()           |                            | | |
                         +-----------------+                            | | |
                         | .flags & FL_KEY |                            | | |
                         +-----------------+                            | | |
                         | .var.idx        |                            | | |
                       +-------------------+ <--- n_fields              | | |
                       | unused            |                            | | |
                       +-------------------+                            | | |
                         |                 |                            | | |
                         +-----------------+                            | | |
                         |                 |                            | | |
                         +-----------------+                            | | |
                         |                 |                            | | |
                         +-----------------+                            | | |
                         |                 |                            | | |
                         +-----------------+                            | | |
                         |                 |                            | | |
                         +-----------------+                            | | |
                                           n_keys = n_fields - n_vals   | | |
                                                                        | | |

This is very similar to the basic case. In the above diagram, we can | | | see a new .flags member has been added to the struct hist_field | | | struct, and a new entry added to hist_data.fields representing the ts0 | | | variable. For a normal val hist_field, .flags is just 0 (modulo | | | modifier flags), but if the value is defined as a variable, the .flags | | | contains a set FL_VAR bit. | | |

As you can see, the ts0 entry’s .var.idx member contains the index | | | into the tracing_map_elts’ .vars[] array containing variable values. | | | This idx is used whenever the value of the variable is set or read. | | | The map_elt.vars idx assigned to the given variable is assigned and | | | saved in .var.idx by create_tracing_map_fields() after it calls | | | tracing_map_add_var(). | | |

Below is a representation of the histogram at run-time, which | | | populates the map, along with correspondence to the above hist_data and | | | hist_field data structures. | | |

The diagram attempts to show the relationship between the | | | hist_data.fields[] and the map_elt.fields[] and map_elt.vars[] with | | | the links drawn between diagrams. For each of the map_elts, you can | | | see that the .fields[] members point to the .sum or .offset of a key | | | or val and the .vars[] members point to the value of a variable. The | | | arrows between the two diagrams show the linkages between those | | | tracing_map members and the field definitions in the corresponding | | | hist_data fields[] members.:

+-----------+                                                           | | |
| hist_data |                                                           | | |
+-----------+                                                           | | |
  | .fields |                                                           | | |
  +---------+     +-----------+                                         | | |
  | .map    |---->| map_entry |                                         | | |
  +---------+     +-----------+                                         | | |
                    | .key    |---> 0                                   | | |
                    +---------+                                         | | |
                    | .val    |---> NULL                                | | |
                  +-----------+                                         | | |
                  | map_entry |                                         | | |
                  +-----------+                                         | | |
                    | .key    |---> pid = 999                           | | |
                    +---------+    +-----------+                        | | |
                    | .val    |--->| map_elt   |                        | | |
                    +---------+    +-----------+                        | | |
                         .           | .key    |---> full key *         | | |
                         .           +---------+    +---------------+   | | |
                         .           | .fields |--->| .sum (val)    |   | | |
                         .           +---------+    | 2345          |   | | |
                         .        +--| .vars   |    +---------------+   | | |
                         .        |  +---------+    | .offset (key) |   | | |
                         .        |                 | 0             |   | | |
                         .        |                 +---------------+   | | |
                         .        |                         .           | | |
                         .        |                         .           | | |
                         .        |                         .           | | |
                         .        |                 +---------------+   | | |
                         .        |                 | .sum (val) or |   | | |
                         .        |                 | .offset (key) |   | | |
                         .        |                 +---------------+   | | |
                         .        |                 | .sum (val) or |   | | |
                         .        |                 | .offset (key) |   | | |
                         .        |                 +---------------+   | | |
                         .        |                                     | | |
                         .        +---------------->+---------------+   | | |
                         .                          | ts0           |<--+ | |
                         .                          | 113345679876  |   | | |
                         .                          +---------------+   | | |
                         .                          | unused        |   | | |
                         .                          |               |   | | |
                         .                          +---------------+   | | |
                         .                                  .           | | |
                         .                                  .           | | |
                         .                                  .           | | |
                         .                          +---------------+   | | |
                         .                          | unused        |   | | |
                         .                          |               |   | | |
                         .                          +---------------+   | | |
                         .                          | unused        |   | | |
                         .                          |               |   | | |
                         .                          +---------------+   | | |
                         .                                              | | |
                  +-----------+                                         | | |
                  | map_entry |                                         | | |
                  +-----------+                                         | | |
                    | .key    |---> pid = 4444                          | | |
                    +---------+    +-----------+                        | | |
                    | .val    |--->| map_elt   |                        | | |
                    +---------+    +-----------+                        | | |
                         .           | .key    |---> full key *         | | |
                         .           +---------+    +---------------+   | | |
                         .           | .fields |--->| .sum (val)    |   | | |
                                     +---------+    | 2345          |   | | |
                                  +--| .vars   |    +---------------+   | | |
                                  |  +---------+    | .offset (key) |   | | |
                                  |                 | 0             |   | | |
                                  |                 +---------------+   | | |
                                  |                         .           | | |
                                  |                         .           | | |
                                  |                         .           | | |
                                  |                 +---------------+   | | |
                                  |                 | .sum (val) or |   | | |
                                  |                 | .offset (key) |   | | |
                                  |                 +---------------+   | | |
                                  |                 | .sum (val) or |   | | |
                                  |                 | .offset (key) |   | | |
                                  |                 +---------------+   | | |
                                  |                                     | | |
                                  |                 +---------------+   | | |
                                  +---------------->| ts0           |<--+ | |
                                                    | 213499240729  |     | |
                                                    +---------------+     | |
                                                    | unused        |     | |
                                                    |               |     | |
                                                    +---------------+     | |
                                                            .             | |
                                                            .             | |
                                                            .             | |
                                                    +---------------+     | |
                                                    | unused        |     | |
                                                    |               |     | |
                                                    +---------------+     | |
                                                    | unused        |     | |
                                                    |               |     | |
                                                    +---------------+     | |

For each used map entry, there’s a map_elt pointing to an array of | | .vars containing the current value of the variables associated with | | that histogram entry. So in the above, the timestamp associated with | | pid 999 is 113345679876, and the timestamp variable in the same | | .var.idx for pid 4444 is 213499240729. | |

sched_switch histogram | | ———————- | |

The sched_switch histogram paired with the above sched_waking | | histogram is shown below. The most important aspect of the | | sched_switch histogram is that it references a variable on the | | sched_waking histogram above. | |

The histogram diagram is very similar to the others so far displayed, | | but it adds variable references. You can see the normal hitcount and | | key fields along with a new wakeup_lat variable implemented in the | | same way as the sched_waking ts0 variable, but in addition there’s an | | entry with the new FL_VAR_REF (short for HIST_FIELD_FL_VAR_REF) flag. | |

Associated with the new var ref field are a couple of new hist_field | | members, var.hist_data and var_ref_idx. For a variable reference, the | | var.hist_data goes with the var.idx, which together uniquely identify | | a particular variable on a particular histogram. The var_ref_idx is | | just the index into the var_ref_vals[] array that caches the values of | | each variable whenever a hist trigger is updated. Those resulting | | values are then finally accessed by other code such as trace action | | code that uses the var_ref_idx values to assign param values. | |

The diagram below describes the situation for the sched_switch | | histogram referred to before:

 # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >>     | |
         events/sched/sched_switch/trigger                                 | |
                                                                           | |
 +------------------+                                                      | |
 | hist_data        |                                                      | |
 +------------------+   +-----------------------+                          | |
   | .fields[]      |-->| val = hitcount        |                          | |
   +----------------+   +-----------------------+                          | |
   | .map           |     | .size               |                          | |
   +----------------+     +---------------------+                          | |
+--| .var_refs[]    |     | .offset             |                          | |
|  +----------------+     +---------------------+                          | |
|                         | .fn()               |                          | |
|   var_ref_vals[]        +---------------------+                          | |
|  +-------------+        | .flags              |                          | |
|  | $ts0        |<---+   +---------------------+                          | |
|  +-------------+    |   | .var.idx            |                          | |
|  |             |    |   +---------------------+                          | |
|  +-------------+    |   | .var.hist_data      |                          | |
|  |             |    |   +---------------------+                          | |
|  +-------------+    |   | .var_ref_idx        |                          | |
|  |             |    | +-----------------------+                          | |
|  +-------------+    | | var = wakeup_lat      |                          | |
|         .           | +-----------------------+                          | |
|         .           |   | .size               |                          | |
|         .           |   +---------------------+                          | |
|  +-------------+    |   | .offset             |                          | |
|  |             |    |   +---------------------+                          | |
|  +-------------+    |   | .fn()               |                          | |
|  |             |    |   +---------------------+                          | |
|  +-------------+    |   | .flags & FL_VAR     |                          | |
|                     |   +---------------------+                          | |
|                     |   | .var.idx            |                          | |
|                     |   +---------------------+                          | |
|                     |   | .var.hist_data      |                          | |
|                     |   +---------------------+                          | |
|                     |   | .var_ref_idx        |                          | |
|                     |   +---------------------+                          | |
|                     |             .                                      | |
|                     |             .                                      | |
|                     |             .                                      | |
|                     | +-----------------------+ <--- n_vals              | |
|                     | | key = pid             |                          | |
|                     | +-----------------------+                          | |
|                     |   | .size               |                          | |
|                     |   +---------------------+                          | |
|                     |   | .offset             |                          | |
|                     |   +---------------------+                          | |
|                     |   | .fn()               |                          | |
|                     |   +---------------------+                          | |
|                     |   | .flags              |                          | |
|                     |   +---------------------+                          | |
|                     |   | .var.idx            |                          | |
|                     | +-----------------------+ <--- n_fields            | |
|                     | | unused                |                          | |
|                     | +-----------------------+                          | |
|                     |   |                     |                          | |
|                     |   +---------------------+                          | |
|                     |   |                     |                          | |
|                     |   +---------------------+                          | |
|                     |   |                     |                          | |
|                     |   +---------------------+                          | |
|                     |   |                     |                          | |
|                     |   +---------------------+                          | |
|                     |   |                     |                          | |
|                     |   +---------------------+                          | |
|                     |                         n_keys = n_fields - n_vals | |
|                     |                                                    | |
|                     |                                                    | |
|                     | +-----------------------+                          | |
+---------------------->| var_ref = $ts0        |                          | |
                      | +-----------------------+                          | |
                      |   | .size               |                          | |
                      |   +---------------------+                          | |
                      |   | .offset             |                          | |
                      |   +---------------------+                          | |
                      |   | .fn()               |                          | |
                      |   +---------------------+                          | |
                      |   | .flags & FL_VAR_REF |                          | |
                      |   +---------------------+                          | |
                      |   | .var.idx            |--------------------------+ |
                      |   +---------------------+                            |
                      |   | .var.hist_data      |----------------------------+
                      |   +---------------------+
                      +---| .var_ref_idx        |
                          +---------------------+

Abbreviations used in the diagrams:

hist_data = struct hist_trigger_data
hist_data.fields = struct hist_field
fn = hist_field_fn_t
FL_KEY = HIST_FIELD_FL_KEY
FL_VAR = HIST_FIELD_FL_VAR
FL_VAR_REF = HIST_FIELD_FL_VAR_REF

When a hist trigger makes use of a variable, a new hist_field is created with flag HIST_FIELD_FL_VAR_REF. For a VAR_REF field, the var.idx and var.hist_data take the same values as the referenced variable, as well as the referenced variable’s size, type, and is_signed values. The VAR_REF field’s .name is set to the name of the variable it references. If a variable reference was created using the explicit system.event.$var_ref notation, the hist_field’s system and event_name variables are also set.

So, in order to handle an event for the sched_switch histogram, because we have a reference to a variable on another histogram, we need to resolve all variable references first. This is done via the resolve_var_refs() calls made from event_hist_trigger(). What this does is grabs the var_refs[] array from the hist_data representing the sched_switch histogram. For each one of those, the referenced variable’s var.hist_data along with the current key is used to look up the corresponding tracing_map_elt in that histogram. Once found, the referenced variable’s var.idx is used to look up the variable’s value using tracing_map_read_var(elt, var.idx), which yields the value of the variable for that element, ts0 in the case above. Note that both the hist_fields representing both the variable and the variable reference have the same var.idx, so this is straightforward.

Variable and variable reference test

This example creates a variable on the sched_waking event, ts0, and uses it in the sched_switch trigger. The sched_switch trigger also creates its own variable, wakeup_lat, but nothing yet uses it:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger

Looking at the sched_waking ‘hist_debug’ output, in addition to the normal key and value hist_fields, in the val fields section we see a field with the HIST_FIELD_FL_VAR flag, which indicates that that field represents a variable. Note that in addition to the variable name, contained in the var.name field, it includes the var.idx, which is the index into the tracing_map_elt.vars[] array of the actual variable location. Note also that the output shows that variables live in the same part of the hist_data->fields[] array as normal values:

# cat events/sched/sched_waking/hist_debug

# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 000000009536f554

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1

Moving on to the sched_switch trigger hist_debug output, in addition to the unused wakeup_lat variable, we see a new section displaying variable references. Variable references are displayed in a separate section because in addition to being logically separate from variables and values, they actually live in a separate hist_data array, var_refs[].

In this example, the sched_switch trigger has a reference to a variable on the sched_waking trigger, $ts0. Looking at the details, we can see that the var.hist_data value of the referenced variable matches the previously displayed sched_waking trigger, and the var.idx value matches the previously displayed var.idx value for that variable. Also displayed is the var_ref_idx value for that variable reference, which is where the value for that variable is cached for use when the trigger is invoked:

# cat events/sched/sched_switch/hist_debug

# event histogram
#
# trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 00000000f4ee8006

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 0
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: next_pid
    type: pid_t
    size: 8
    is_signed: 1

variable reference fields:

  hist_data->var_refs[0]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 000000009536f554
    var_ref_idx (into hist_data->var_refs[]): 0
    type: u64
    size: 8
    is_signed: 0

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger

# echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

Actions and Handlers

Adding onto the previous example, we will now do something with that wakeup_lat variable, namely send it and another field as a synthetic event.

The onmatch() action below basically says that whenever we have a sched_switch event, if we have a matching sched_waking event, in this case if we have a pid in the sched_waking histogram that matches the next_pid field on this sched_switch event, we retrieve the variables specified in the wakeup_latency() trace action, and use them to generate a new wakeup_latency event into the trace stream.

Note that the way the trace handlers such as wakeup_latency() (which could equivalently be written trace(wakeup_latency,$wakeup_lat,next_pid) are implemented, the parameters specified to the trace handler must be variables. In this case, $wakeup_lat is obviously a variable, but next_pid isn’t, since it’s just naming a field in the sched_switch trace event. Since this is something that almost every trace() and save() action does, a special shortcut is implemented to allow field names to be used directly in those cases. How it works is that under the covers, a temporary variable is created for the named field, and this variable is what is actually passed to the trace handler. In the code and documentation, this type of variable is called a ‘field variable’.

Fields on other trace event’s histograms can be used as well. In that case we have to generate a new histogram and an unfortunately named ‘synthetic_field’ (the use of synthetic here has nothing to do with synthetic events) and use that special histogram field as a variable.

The diagram below illustrates the new elements described above in the context of the sched_switch histogram using the onmatch() handler and the trace() action.

First, we define the wakeup_latency synthetic event:

# echo 'wakeup_latency u64 lat; pid_t pid' >> synthetic_events

Next, the sched_waking hist trigger as before:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
        events/sched/sched_waking/trigger

Finally, we create a hist trigger on the sched_switch event that generates a wakeup_latency() trace event. In this case we pass next_pid into the wakeup_latency synthetic event invocation, which means it will be automatically converted into a field variable:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
        onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid)' >>
        /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

The diagram for the sched_switch event is similar to previous examples but shows the additional field_vars[] array for hist_data and shows the linkages between the field_vars and the variables and references created to implement the field variables. The details are discussed below:

  +------------------+
  | hist_data        |
  +------------------+   +-----------------------+
    | .fields[]      |-->| val = hitcount        |
    +----------------+   +-----------------------+
    | .map           |     | .size               |
    +----------------+     +---------------------+
+---| .field_vars[]  |     | .offset             |
|   +----------------+     +---------------------+
|+--| .var_refs[]    |     | .offset             |
||  +----------------+     +---------------------+
||                         | .fn()               |
||   var_ref_vals[]        +---------------------+
||  +-------------+        | .flags              |
||  | $ts0        |<---+   +---------------------+
||  +-------------+    |   | .var.idx            |
||  | $next_pid   |<-+ |   +---------------------+
||  +-------------+  | |   | .var.hist_data      |
||+>| $wakeup_lat |  | |   +---------------------+
||| +-------------+  | |   | .var_ref_idx        |
||| |             |  | | +-----------------------+
||| +-------------+  | | | var = wakeup_lat      |
|||        .         | | +-----------------------+
|||        .         | |   | .size               |
|||        .         | |   +---------------------+
||| +-------------+  | |   | .offset             |
||| |             |  | |   +---------------------+
||| +-------------+  | |   | .fn()               |
||| |             |  | |   +---------------------+
||| +-------------+  | |   | .flags & FL_VAR     |
|||                  | |   +---------------------+
|||                  | |   | .var.idx            |
|||                  | |   +---------------------+
|||                  | |   | .var.hist_data      |
|||                  | |   +---------------------+
|||                  | |   | .var_ref_idx        |
|||                  | |   +---------------------+
|||                  | |              .
|||                  | |              .
|||                  | |              .
|||                  | |              .
||| +--------------+ | |              .
+-->| field_var    | | |              .
 || +--------------+ | |              .
 ||   | var        | | |              .
 ||   +------------+ | |              .
 ||   | val        | | |              .
 || +--------------+ | |              .
 || | field_var    | | |              .
 || +--------------+ | |              .
 ||   | var        | | |              .
 ||   +------------+ | |              .
 ||   | val        | | |              .
 ||   +------------+ | |              .
 ||         .        | |              .
 ||         .        | |              .
 ||         .        | | +-----------------------+ <--- n_vals
 || +--------------+ | | | key = pid             |
 || | field_var    | | | +-----------------------+
 || +--------------+ | |   | .size               |
 ||   | var        |--+|   +---------------------+
 ||   +------------+ |||   | .offset             |
 ||   | val        |-+||   +---------------------+
 ||   +------------+ |||   | .fn()               |
 ||                  |||   +---------------------+
 ||                  |||   | .flags              |
 ||                  |||   +---------------------+
 ||                  |||   | .var.idx            |
 ||                  |||   +---------------------+ <--- n_fields
 ||                  |||
 ||                  |||                           n_keys = n_fields - n_vals
 ||                  ||| +-----------------------+
 ||                  |+->| var = next_pid        |
 ||                  | | +-----------------------+
 ||                  | |   | .size               |
 ||                  | |   +---------------------+
 ||                  | |   | .offset             |
 ||                  | |   +---------------------+
 ||                  | |   | .flags & FL_VAR     |
 ||                  | |   +---------------------+
 ||                  | |   | .var.idx            |
 ||                  | |   +---------------------+
 ||                  | |   | .var.hist_data      |
 ||                  | | +-----------------------+
 ||                  +-->| val for next_pid      |
 ||                  | | +-----------------------+
 ||                  | |   | .size               |
 ||                  | |   +---------------------+
 ||                  | |   | .offset             |
 ||                  | |   +---------------------+
 ||                  | |   | .fn()               |
 ||                  | |   +---------------------+
 ||                  | |   | .flags              |
 ||                  | |   +---------------------+
 ||                  | |   |                     |
 ||                  | |   +---------------------+
 ||                  | |
 ||                  | |
 ||                  | | +-----------------------+
 +|------------------|-|>| var_ref = $ts0        |
  |                  | | +-----------------------+
  |                  | |   | .size               |
  |                  | |   +---------------------+
  |                  | |   | .offset             |
  |                  | |   +---------------------+
  |                  | |   | .fn()               |
  |                  | |   +---------------------+
  |                  | |   | .flags & FL_VAR_REF |
  |                  | |   +---------------------+
  |                  | +---| .var_ref_idx        |
  |                  |   +-----------------------+
  |                  |   | var_ref = $next_pid   |
  |                  |   +-----------------------+
  |                  |     | .size               |
  |                  |     +---------------------+
  |                  |     | .offset             |
  |                  |     +---------------------+
  |                  |     | .fn()               |
  |                  |     +---------------------+
  |                  |     | .flags & FL_VAR_REF |
  |                  |     +---------------------+
  |                  +-----| .var_ref_idx        |
  |                      +-----------------------+
  |                      | var_ref = $wakeup_lat |
  |                      +-----------------------+
  |                        | .size               |
  |                        +---------------------+
  |                        | .offset             |
  |                        +---------------------+
  |                        | .fn()               |
  |                        +---------------------+
  |                        | .flags & FL_VAR_REF |
  |                        +---------------------+
  +------------------------| .var_ref_idx        |
                           +---------------------+

As you can see, for a field variable, two hist_fields are created: one representing the variable, in this case next_pid, and one to actually get the value of the field from the trace stream, like a normal val field does. These are created separately from normal variable creation and are saved in the hist_data->field_vars[] array. See below for how these are used. In addition, a reference hist_field is also created, which is needed to reference the field variables such as $next_pid variable in the trace() action.

Note that $wakeup_lat is also a variable reference, referencing the value of the expression common_timestamp-$ts0, and so also needs to have a hist field entry representing that reference created.

When hist_trigger_elt_update() is called to get the normal key and value fields, it also calls update_field_vars(), which goes through each field_var created for the histogram, and available from hist_data->field_vars and calls val->fn() to get the data from the current trace record, and then uses the var’s var.idx to set the variable at the var.idx offset in the appropriate tracing_map_elt’s variable at elt->vars[var.idx].

Once all the variables have been updated, resolve_var_refs() can be called from event_hist_trigger(), and not only can our $ts0 and $next_pid references be resolved but the $wakeup_lat reference as well. At this point, the trace() action can simply access the values assembled in the var_ref_vals[] array and generate the trace event.

The same process occurs for the field variables associated with the save() action.

Abbreviations used in the diagram:

hist_data = struct hist_trigger_data
hist_data.fields = struct hist_field
field_var = struct field_var
fn = hist_field_fn_t
FL_KEY = HIST_FIELD_FL_KEY
FL_VAR = HIST_FIELD_FL_VAR
FL_VAR_REF = HIST_FIELD_FL_VAR_REF

trace() action field variable test

This example adds to the previous test example by finally making use of the wakeup_lat variable, but in addition also creates a couple of field variables that then are all passed to the wakeup_latency() trace action via the onmatch() handler.

First, we create the wakeup_latency synthetic event:

# echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events

Next, the sched_waking trigger from previous examples:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

Finally, as in the previous test example, we calculate and assign the wakeup latency using the $ts0 reference from the sched_waking trigger to the wakeup_lat variable, and finally use it along with a couple sched_switch event fields, next_pid and next_comm, to generate a wakeup_latency trace event. The next_pid and next_comm event fields are automatically converted into field variables for this purpose:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

The sched_waking hist_debug output shows the same data as in the previous test example:

# cat events/sched/sched_waking/hist_debug

# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 00000000d60ff61f

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1

The sched_switch hist_debug output shows the same key and value fields as in the previous test example - note that wakeup_lat is still in the val fields section, but that the new field variables are not there - although the field variables are variables, they’re held separately in the hist_data’s field_vars[] array. Although the field variables and the normal variables are located in separate places, you can see that the actual variable locations for those variables in the tracing_map_elt.vars[] do have increasing indices as expected: wakeup_lat takes the var.idx = 0 slot, while the field variables for next_pid and next_comm have values var.idx = 1, and var.idx = 2. Note also that those are the same values displayed for the variable references corresponding to those variables in the variable reference fields section. Since there are two triggers and thus two hist_data addresses, those addresses also need to be accounted for when doing the matching - you can see that the first variable refers to the 0 var.idx on the previous hist trigger (see the hist_data address associated with that trigger), while the second variable refers to the 0 var.idx on the sched_switch hist trigger, as do all the remaining variable references.

Finally, the action tracking variables section just shows the system and event name for the onmatch() handler:

# cat events/sched/sched_switch/hist_debug

# event histogram
#
# trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm) [active]
#

hist_data: 0000000008f551b7

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 0
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: next_pid
    type: pid_t
    size: 8
    is_signed: 1

variable reference fields:

  hist_data->var_refs[0]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 00000000d60ff61f
    var_ref_idx (into hist_data->var_refs[]): 0
    type: u64
    size: 8
    is_signed: 0

  hist_data->var_refs[1]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 0000000008f551b7
    var_ref_idx (into hist_data->var_refs[]): 1
    type: u64
    size: 0
    is_signed: 0

  hist_data->var_refs[2]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: next_pid
    var.idx (into tracing_map_elt.vars[]): 1
    var.hist_data: 0000000008f551b7
    var_ref_idx (into hist_data->var_refs[]): 2
    type: pid_t
    size: 4
    is_signed: 0

  hist_data->var_refs[3]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: next_comm
    var.idx (into tracing_map_elt.vars[]): 2
    var.hist_data: 0000000008f551b7
    var_ref_idx (into hist_data->var_refs[]): 3
    type: char[16]
    size: 256
    is_signed: 0

field variables:

  hist_data->field_vars[0]:

    field_vars[0].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: next_pid
    var.idx (into tracing_map_elt.vars[]): 1

    field_vars[0].val:
    ftrace_event_field name: next_pid
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->field_vars[1]:

    field_vars[1].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: next_comm
    var.idx (into tracing_map_elt.vars[]): 2

    field_vars[1].val:
    ftrace_event_field name: next_comm
    type: char[16]
    size: 256
    is_signed: 0

action tracking variables (for onmax()/onchange()/onmatch()):

  hist_data->actions[0].match_data.event_system: sched
  hist_data->actions[0].match_data.event: sched_waking

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

# echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

# echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events

action_data and the trace() action

As mentioned above, when the trace() action generates a synthetic event, all the parameters to the synthetic event either already are variables or are converted into variables (via field variables), and finally all those variable values are collected via references to them into a var_ref_vals[] array.

The values in the var_ref_vals[] array, however, don’t necessarily follow the same ordering as the synthetic event params. To address that, struct action_data contains another array, var_ref_idx[] that maps the trace action params to the var_ref_vals[] values. Below is a diagram illustrating that for the wakeup_latency() synthetic event:

+------------------+     wakeup_latency()
| action_data      |       event params               var_ref_vals[]
+------------------+    +-----------------+        +-----------------+
  | .var_ref_idx[] |--->| $wakeup_lat idx |---+    |                 |
  +----------------+    +-----------------+   |    +-----------------+
  | .synth_event   |    | $next_pid idx   |---|-+  | $wakeup_lat val |
  +----------------+    +-----------------+   | |  +-----------------+
                                 .            | +->| $next_pid val   |
                                 .            |    +-----------------+
                                 .            |           .
                        +-----------------+   |           .
                        |                 |   |           .
                        +-----------------+   |    +-----------------+
                                              +--->| $wakeup_lat val |
                                                   +-----------------+

Basically, how this ends up getting used in the synthetic event probe function, trace_event_raw_event_synth(), is as follows:

for each field i in .synth_event
  val_idx = .var_ref_idx[i]
  val = var_ref_vals[val_idx]

action_data and the onXXX() handlers

The hist trigger onXXX() actions other than onmatch(), such as onmax() and onchange(), also make use of and internally create hidden variables. This information is contained in the action_data.track_data struct, and is also visible in the hist_debug output as will be described in the example below.

Typically, the onmax() or onchange() handlers are used in conjunction with the save() and snapshot() actions. For example:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
        onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >>
        /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

or:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
        onmax($wakeup_lat).snapshot()' >>
        /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

save() action field variable test

For this example, instead of generating a synthetic event, the save() action is used to save field values whenever an onmax() handler detects that a new max latency has been hit. As in the previous example, the values being saved are also field values, but in this case, are kept in a separate hist_data array named save_vars[].

As in previous test examples, we set up the sched_waking trigger:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

In this case, however, we set up the sched_switch trigger to save some sched_switch field values whenever we hit a new maximum latency. For both the onmax() handler and save() action, variables will be created, which we can use the hist_debug files to examine:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger

The sched_waking hist_debug output shows the same data as in the previous test examples:

# cat events/sched/sched_waking/hist_debug

#
# trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 00000000e6290f48

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1

The output of the sched_switch trigger shows the same val and key values as before, but also shows a couple new sections.

First, the action tracking variables section now shows the actions[].track_data information describing the special tracking variables and references used to track, in this case, the running maximum value. The actions[].track_data.var_ref member contains the reference to the variable being tracked, in this case the $wakeup_lat variable. In order to perform the onmax() handler function, there also needs to be a variable that tracks the current maximum by getting updated whenever a new maximum is hit. In this case, we can see that an auto-generated variable named ‘ __max’ has been created and is visible in the actions[].track_data.track_var variable.

Finally, in the new ‘save action variables’ section, we can see that the 4 params to the save() function have resulted in 4 field variables being created for the purposes of saving the values of the named fields when the max is hit. These variables are kept in a separate save_vars[] array off of hist_data, so are displayed in a separate section:

# cat events/sched/sched_switch/hist_debug

# event histogram
#
# trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) [active]
#

hist_data: 0000000057bcd28d

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 0
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: next_pid
    type: pid_t
    size: 8
    is_signed: 1

variable reference fields:

  hist_data->var_refs[0]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 00000000e6290f48
    var_ref_idx (into hist_data->var_refs[]): 0
    type: u64
    size: 8
    is_signed: 0

  hist_data->var_refs[1]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 0000000057bcd28d
    var_ref_idx (into hist_data->var_refs[]): 1
    type: u64
    size: 0
    is_signed: 0

action tracking variables (for onmax()/onchange()/onmatch()):

  hist_data->actions[0].track_data.var_ref:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 0000000057bcd28d
    var_ref_idx (into hist_data->var_refs[]): 1
    type: u64
    size: 0
    is_signed: 0

  hist_data->actions[0].track_data.track_var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: __max
    var.idx (into tracing_map_elt.vars[]): 1
    type: u64
    size: 8
    is_signed: 0

save action variables (save() params):

  hist_data->save_vars[0]:

    save_vars[0].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: next_comm
    var.idx (into tracing_map_elt.vars[]): 2

    save_vars[0].val:
    ftrace_event_field name: next_comm
    type: char[16]
    size: 256
    is_signed: 0

  hist_data->save_vars[1]:

    save_vars[1].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: prev_pid
    var.idx (into tracing_map_elt.vars[]): 3

    save_vars[1].val:
    ftrace_event_field name: prev_pid
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->save_vars[2]:

    save_vars[2].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: prev_prio
    var.idx (into tracing_map_elt.vars[]): 4

    save_vars[2].val:
    ftrace_event_field name: prev_prio
    type: int
    size: 4
    is_signed: 1

  hist_data->save_vars[3]:

    save_vars[3].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: prev_comm
    var.idx (into tracing_map_elt.vars[]): 5

    save_vars[3].val:
    ftrace_event_field name: prev_comm
    type: char[16]
    size: 256
    is_signed: 0

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger

# echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

A couple special cases

While the above covers the basics of the histogram internals, there are a couple of special cases that should be discussed, since they tend to create even more confusion. Those are field variables on other histograms, and aliases, both described below through example tests using the hist_debug files.

Test of field variables on other histograms

This example is similar to the previous examples, but in this case, the sched_switch trigger references a hist trigger field on another event, namely the sched_waking event. In order to accomplish this, a field variable is created for the other event, but since an existing histogram can’t be used, as existing histograms are immutable, a new histogram with a matching variable is created and used, and we’ll see that reflected in the hist_debug output shown below.

First, we create the wakeup_latency synthetic event. Note the addition of the prio field:

# echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events

As in previous test examples, we set up the sched_waking trigger:

# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

Here we set up a hist trigger on sched_switch to send a wakeup_latency event using an onmatch handler naming the sched_waking event. Note that the third param being passed to the wakeup_latency() is prio, which is a field name that needs to have a field variable created for it. There isn’t however any prio field on the sched_switch event so it would seem that it wouldn’t be possible to create a field variable for it. The matching sched_waking event does have a prio field, so it should be possible to make use of it for this purpose. The problem with that is that it’s not currently possible to define a new variable on an existing histogram, so it’s not possible to add a new prio field variable to the existing sched_waking histogram. It is however possible to create an additional new ‘matching’ sched_waking histogram for the same event, meaning that it uses the same key and filters, and define the new prio field variable on that.

Here’s the sched_switch trigger:

# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger

And here’s the output of the hist_debug information for the sched_waking hist trigger. Note that there are two histograms displayed in the output: the first is the normal sched_waking histogram we’ve seen in the previous examples, and the second is the special histogram we created to provide the prio field variable.

Looking at the second histogram below, we see a variable with the name synthetic_prio. This is the field variable created for the prio field on that sched_waking histogram:

# cat events/sched/sched_waking/hist_debug

# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 00000000349570e4

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1


# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:synthetic_prio=prio:sort=hitcount:size=2048 [active]
#

hist_data: 000000006920cf38

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    ftrace_event_field name: prio
    var.name: synthetic_prio
    var.idx (into tracing_map_elt.vars[]): 0
    type: int
    size: 4
    is_signed: 1

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1

Looking at the sched_switch histogram below, we can see a reference to the synthetic_prio variable on sched_waking, and looking at the associated hist_data address we see that it is indeed associated with the new histogram. Note also that the other references are to a normal variable, wakeup_lat, and to a normal field variable, next_pid, the details of which are in the field variables section:

# cat events/sched/sched_switch/hist_debug

# event histogram
#
# trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio) [active]
#

hist_data: 00000000a73b67df

n_vals: 2
n_keys: 1
n_fields: 3

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    type: u64
    size: 0
    is_signed: 0

key fields:

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: next_pid
    type: pid_t
    size: 8
    is_signed: 1

variable reference fields:

  hist_data->var_refs[0]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: ts0
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 00000000349570e4
    var_ref_idx (into hist_data->var_refs[]): 0
    type: u64
    size: 8
    is_signed: 0

  hist_data->var_refs[1]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 00000000a73b67df
    var_ref_idx (into hist_data->var_refs[]): 1
    type: u64
    size: 0
    is_signed: 0

  hist_data->var_refs[2]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: next_pid
    var.idx (into tracing_map_elt.vars[]): 1
    var.hist_data: 00000000a73b67df
    var_ref_idx (into hist_data->var_refs[]): 2
    type: pid_t
    size: 4
    is_signed: 0

  hist_data->var_refs[3]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: synthetic_prio
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 000000006920cf38
    var_ref_idx (into hist_data->var_refs[]): 3
    type: int
    size: 4
    is_signed: 1

field variables:

  hist_data->field_vars[0]:

    field_vars[0].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: next_pid
    var.idx (into tracing_map_elt.vars[]): 1

    field_vars[0].val:
    ftrace_event_field name: next_pid
    type: pid_t
    size: 4
    is_signed: 1

action tracking variables (for onmax()/onchange()/onmatch()):

  hist_data->actions[0].match_data.event_system: sched
  hist_data->actions[0].match_data.event: sched_waking

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger

# echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

# echo '!wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events

Alias test

This example is very similar to previous examples, but demonstrates the alias flag.

First, we create the wakeup_latency synthetic event:

# echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events

Next, we create a sched_waking trigger similar to previous examples, but in this case we save the pid in the waking_pid variable:

# echo 'hist:keys=pid:waking_pid=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

For the sched_switch trigger, instead of using $waking_pid directly in the wakeup_latency synthetic event invocation, we create an alias of $waking_pid named $woken_pid, and use that in the synthetic event invocation instead:

# echo 'hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger

Looking at the sched_waking hist_debug output, in addition to the normal fields, we can see the waking_pid variable:

# cat events/sched/sched_waking/hist_debug

# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:waking_pid=pid,ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
#

hist_data: 00000000a250528c

n_vals: 3
n_keys: 1
n_fields: 4

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
    ftrace_event_field name: pid
    var.name: waking_pid
    var.idx (into tracing_map_elt.vars[]): 0
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: ts0
    var.idx (into tracing_map_elt.vars[]): 1
    type: u64
    size: 8
    is_signed: 0

key fields:

  hist_data->fields[3]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: pid
    type: pid_t
    size: 8
    is_signed: 1

The sched_switch hist_debug output shows that a variable named woken_pid has been created but that it also has the HIST_FIELD_FL_ALIAS flag set. It also has the HIST_FIELD_FL_VAR flag set, which is why it appears in the val field section.

Despite that implementation detail, an alias variable is actually more like a variable reference; in fact it can be thought of as a reference to a reference. The implementation copies the var_ref->fn() from the variable reference being referenced, in this case, the waking_pid fn(), which is hist_field_var_ref() and makes that the fn() of the alias. The hist_field_var_ref() fn() requires the var_ref_idx of the variable reference it’s using, so waking_pid’s var_ref_idx is also copied to the alias. The end result is that when the value of alias is retrieved, in the end it just does the same thing the original reference would have done and retrieves the same value from the var_ref_vals[] array. You can verify this in the output by noting that the var_ref_idx of the alias, in this case woken_pid, is the same as the var_ref_idx of the reference, waking_pid, in the variable reference fields section.

Additionally, once it gets that value, since it is also a variable, it then saves that value into its var.idx. So the var.idx of the woken_pid alias is 0, which it fills with the value from var_ref_idx 0 when its fn() is called to update itself. You’ll also notice that there’s a woken_pid var_ref in the variable refs section. That is the reference to the woken_pid alias variable, and you can see that it retrieves the value from the same var.idx as the woken_pid alias, 0, and then in turn saves that value in its own var_ref_idx slot, 3, and the value at this position is finally what gets assigned to the $woken_pid slot in the trace event invocation:

# cat events/sched/sched_switch/hist_debug

# event histogram
#
# trigger info: hist:keys=next_pid:vals=hitcount:woken_pid=$waking_pid,wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm) [active]
#

hist_data: 0000000055d65ed0

n_vals: 3
n_keys: 1
n_fields: 4

val fields:

  hist_data->fields[0]:
    flags:
      VAL: HIST_FIELD_FL_HITCOUNT
    type: u64
    size: 8
    is_signed: 0

  hist_data->fields[1]:
    flags:
      HIST_FIELD_FL_VAR
      HIST_FIELD_FL_ALIAS
    var.name: woken_pid
    var.idx (into tracing_map_elt.vars[]): 0
    var_ref_idx (into hist_data->var_refs[]): 0
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->fields[2]:
    flags:
      HIST_FIELD_FL_VAR
    var.name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 1
    type: u64
    size: 0
    is_signed: 0

key fields:

  hist_data->fields[3]:
    flags:
      HIST_FIELD_FL_KEY
    ftrace_event_field name: next_pid
    type: pid_t
    size: 8
    is_signed: 1

variable reference fields:

  hist_data->var_refs[0]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: waking_pid
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 00000000a250528c
    var_ref_idx (into hist_data->var_refs[]): 0
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->var_refs[1]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: ts0
    var.idx (into tracing_map_elt.vars[]): 1
    var.hist_data: 00000000a250528c
    var_ref_idx (into hist_data->var_refs[]): 1
    type: u64
    size: 8
    is_signed: 0

  hist_data->var_refs[2]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: wakeup_lat
    var.idx (into tracing_map_elt.vars[]): 1
    var.hist_data: 0000000055d65ed0
    var_ref_idx (into hist_data->var_refs[]): 2
    type: u64
    size: 0
    is_signed: 0

  hist_data->var_refs[3]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: woken_pid
    var.idx (into tracing_map_elt.vars[]): 0
    var.hist_data: 0000000055d65ed0
    var_ref_idx (into hist_data->var_refs[]): 3
    type: pid_t
    size: 4
    is_signed: 1

  hist_data->var_refs[4]:
    flags:
      HIST_FIELD_FL_VAR_REF
    name: next_comm
    var.idx (into tracing_map_elt.vars[]): 2
    var.hist_data: 0000000055d65ed0
    var_ref_idx (into hist_data->var_refs[]): 4
    type: char[16]
    size: 256
    is_signed: 0

field variables:

  hist_data->field_vars[0]:

    field_vars[0].var:
    flags:
      HIST_FIELD_FL_VAR
    var.name: next_comm
    var.idx (into tracing_map_elt.vars[]): 2

    field_vars[0].val:
    ftrace_event_field name: next_comm
    type: char[16]
    size: 256
    is_signed: 0

action tracking variables (for onmax()/onchange()/onmatch()):

  hist_data->actions[0].match_data.event_system: sched
  hist_data->actions[0].match_data.event: sched_waking

The commands below can be used to clean things up for the next test:

# echo '!hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger

# echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger

# echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events