DTK (data toolkit) is a suite of tools for parsing, analyzing, and graphing logs and other datasets.
In short, DTK converts data into knowledge.
Note: In the examples in this document, a prefix of $
indicates that the line is a command at a bash
-like shell prompt, and a token like <count>
is a placeholder for some actual value named "count".
Are your logs just a huge pile of data you lug around because "you might need them some day"? Do you make decisions without the knowledge locked away in those files because it's cumbersome to get from an Apache log to user behavior statistics? Do you waste hours digging through logs to track down the root cause of an incident? You're in the right place. Have a seat.
The modules in DTK follow the Unix philosophy - they do one thing and do it well, they work together, and they operate on text streams.
DTK provides many tools called "modules"; much like Git, they are all accessible through the launcher dtk
- dtk help
, dtk filter
, dtk parse
, etc. Each module focuses on solving one kind of problem while accepting and producing data in a reusable, common format.
Because of this, a DTK workflow often involves a pipe chain consisting of both DTK modules and other programs. These can be simple or complex:
# flags and other arguments omitted for brevity:
$ zcat | dtk parse | dtk hist1
$ zcat | grep | cut | cut | dtk filter | dtk uc | dtk filter | dtk hist2_bycol
$ cat <(zcat | dtk parse | dtk filter) <(zcat | dtk parse | dtk filter) | dtk delta1
Some DTK pipelines work like map-reduce jobs, and, in fact, the DTK suite is also quite effective when run in true big data map-reduce environments such as Hadoop Streaming.
DTK prioritizes usefulness and efficiency over shiny interfaces and PowerPoint presentations. These aren't your boss' tools (unless he also likes this sort of thing, of course).
The main dtk
launcher, which is normally at /usr/bin/dtk
, is responsible for finding and invoking the individual modules, which are normally in /usr/libexec/dtk-modules
. To invoke a module, use dtk <module>
, much as you would with git
. If you would like to load modules from a different directory (such as during dtk
development or in a homedir-local install), set the DTK_MODPATH
environment variable to a colon-delimited list of module directories, much like the PATH
environment variable. An empty string as one of the parts of DTK_MODPATH
will be replaced with the default (/usr/libexec/dtk-modules
), making it easy to provide overly directories by using something like :~/dtk-modules
(which would expand to /usr/libexec/dtk-modules
and ~/dtk-modules
).
To get details on a specific module, use dtk help <module>
, which simply invokes dtk <module> --help
for you.
Some modules (like hist2
and plot
) work better when more colors are available. For best results, get a terminal with 256-color support, and then set your TERM
environment variable to xterm-256color
. If you are within screen, you may have to do this explicitly, like cat stuff | TERM=xterm-256color dtk hist2
.
Most of the tools produce or operate on newline-terminated records containing tab-delimited fields, often called TSV ("tab-separated values"). Unless otherwise specified, tools in DTK take input on STDIN and produce output on STDOUT.
Merely invoking dtk help
produces a list of all available modules.
To get full documentation on a specific module, use dtk help <module>
(like dtk help uc
) which just invokes dtk <module> --help
for you.
The filter
module is similar to perl -ne
- it runs its first argument as Perl code wrapped in some convenience logic. The given code is run for each record after filling @v
with each field; the final value of @v
after the code is run is used as the output record. Use next
to drop a record or last
to skip all remaining records.
For example, suppose you have a dataset containing pairs of numbers:
$ perl -e 'print int(rand()*100)."\t".int(rand()*100)."\n" for 1..100' > data
$ head -n 5 data
50 34
13 15
37 5
62 2
7 88
You could add a column which contains the sum of the first two:
$ cat data | dtk filter '$v[2] = $v[0] + $v[1]' | head -n 5
50 34 84
13 15 28
37 5 42
62 2 64
7 88 95
Or perhaps you only want records where the first field is less than 20 and the second field is more than 80:
$ cat data | dtk filter 'next unless $v[0] < 20 && $v[1] > 80'
7 88
10 86
8 82
2 82
12 87
0 91
14 95
Because DTK is commonly used to analyze log files, helper functions are also available for fast parsing of data often found in logs:
parse_uri
- Takes a URI (or part of one, like
/path/to/thing?a=1&b=2
orfile:///path/to/thing.doc
) and returns an empty list on failure or otherwise a hash containingschema
,auth
,host
,port
,path
,query
, andfragment
. parse_query
- Takes a query string (key/value pairs in the
application/x-www-form-urlencoded
format) and returns a hash containing the key/value pairs. Keys with no=value
part will receive the value1
, such thata&b=2
will return(a=>1, b=>2)
. parse_cookies
- Takes a cookie string (key/value pairs like
a=1; b=2; c=3
) and returns a hash containing the key/value pairs. decode_uri
- Takes a URI-encoded string (containing codes like
%2d
for=
) and returns the decoded string.
Here are some example URIs and the result of parse_uri
:
$ cat uris
http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
http://www.google.com/
/path/to/thing?a=1&b=2
file:///path/to/thing.doc
$ cat uris | dtk filter 'my %u = parse_uri($v[0]); @v = ($u{host}, $v[0])'
www.google.com http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
www.google.com http://www.google.com/
/path/to/thing?a=1&b=2
file:///path/to/thing.doc
$ cat uris | dtk filter 'my %u = parse_uri($v[0]); @v = ($u{query}, $v[0])'
a=1&b=2&c http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
http://www.google.com/
a=1&b=2 /path/to/thing?a=1&b=2
file:///path/to/thing.doc
Here is a filter which shows the query string parameters b
and c
when parameter a
has a truthy value:
$ cat uris | dtk filter 'my %u = parse_uri($v[0]); my %q = parse_query($u{query}); next unless $q{a}; @v = ($q{b}, $q{c}, $v[0])'
2 1 http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
2 /path/to/thing?a=1&b=2
The parse
module parses specific file formats and produces the fields of each record in DTK-friendly tab-delimited output. To get a list of known formats, use dtk parse --help
(or dtk help parse
). To see details on a specific format, use dtk parse --help <format>
(or dtk help parse <format>
).
Parse formats are executable (chmod +x
) Perl scripts discovered in the parse-formats
subdirectory of any path given in DTK_MODPATH
, which by default causes a search only in /usr/libexec/dtk-modules/parse-formats/
. The following parse formats are packaged with DTK:
apache_access
- For parsing the default Apache access logs in the built-in
common
orcombined
log formats.
To parse a format, pipe it to dtk parse <format>
to get all fields, or dtk parse <format> <field>,<field>,...
to get specific fields. For example:
$ cat access_log | dtk parse apache_access datetime,bytes
09/May/2012:16:00:00 -0400 -
09/May/2012:16:00:00 -0400 -
09/May/2012:16:00:00 -0400 43
09/May/2012:16:00:00 -0400 -
09/May/2012:15:59:59 -0400 -
09/May/2012:16:00:00 -0400 -
09/May/2012:15:59:59 -0400 931
09/May/2012:16:00:00 -0400 43
09/May/2012:15:59:59 -0400 179090
09/May/2012:16:00:00 -0400 515
Some fields have extra post-filters which can be applied by specifying the field as <field>:<filter>
- these are shown in the format details in parentheses after each field. For example, apache_access
's datetime
field can be converted to epoch time, bytes
can be forced to a number (so -
becomes 0
), and useragent
can be categorized into general classes for easy bucketing:
$ cat access_log | dtk parse apache_access datetime:epoch,bytes:numeric,useragent:class
1336593600 0 ie8
1336593600 0 ie8
1336593600 43 ie9
1336593600 0 ie8
1336593599 0 ie9
1336593600 0 ie9
1336593599 931 firefox
1336593600 43 ie9
1336593599 179090 ie7
1336593600 515 ie9
Each parse format is merely an executable (chmod +x
) Perl script which returns a data structure representing instructions on what fields are available, how to build a regular expression which extracts those fields, and any filters that can be applied to the values in those fields before returning them. These instructions are represented as an array reference containing either a string (to be skipped as a literal delimiter at that position in each record) or an array reference containing the field's name, regex, and optional filters. When loaded, parse
will determine from this data the minimum regular expression required to extract the requested fields.
Suppose your application creates logs like this:
[2000-01-01 12:00:42 GMT] INFO - "system" Application starting
[2000-01-01 12:03:04 GMT] Warn - "Bob Johnson" Wrong password
[2000-01-01 12:03:18 GMT] INFO - "Bob Johnson" User has connected
[2000-01-01 12:12:53 GMT] critical - "Bob Johnson" Home directory full
[2000-01-01 13:45:08 GMT] INFO - "system" Home directory cleanup script started
This log is messy, weird, and exactly what you might find in the real world. Let's parse it with DTK by making a new weirdlog
parse format in the parse-formats
directory of one of our module directories and set it to executable (chmod +x parse-formats/weirdlog
). Here's our first attempt:
use warnings;
use strict;
return [
'[',
[datetime => qr/[^\]]+/],
'] ',
[level => qr/\S+/ ],
' - "',
[user => qr/[^\"]+/],
'" ',
[message => qr/.+/ ],
];
This parse format gives us four fields and returns them verbatim from the log. We can see the fields by asking DTK about them:
$ dtk help parse weirdlog
Fields (and filters) for format 'weirdlog' are:
datetime
level
user
message
Then, we can pull the log apart into DTK-friendly tab-separated format:
$ cat weirdlog | dtk parse weirdlog
2000-01-01 12:00:42 GMT INFO system Application starting
2000-01-01 12:03:04 GMT Warn Bob Johnson Wrong password
2000-01-01 12:03:18 GMT INFO Bob Johnson User has connected
2000-01-01 12:12:53 GMT critical Bob Johnson Home directory full
2000-01-01 13:45:08 GMT INFO system Home directory cleanup script started
Now, let's update the parse format with some filters to help clean the data up a little:
use warnings;
use strict;
use Date::Parse; #provides str2time() for human-to-epoch date parsing
return [
'[',
[datetime => qr/[^\]]+/, {epoch=>\&str2time}],
'] ',
[level => qr/\S+/, {lowercase=>sub{ lc $_[0] }, uppercase=>sub{ uc $_[0] }}],
' - "',
[user => qr/[^\"]+/],
'" ',
[message => qr/.+/ ],
];
We can see our new filters in the field list:
dtk help parse weirdlog
Fields (and filters) for format 'weirdlog' are:
datetime (epoch)
level (uppercase, lowercase)
user
message
Let's use our new filters and also rearrange the fields for good measure:
$ cat weirdlog | dtk parse weirdlog user,level:lowercase,datetime:epoch
system info 946728042
Bob Johnson warn 946728184
Bob Johnson info 946728198
Bob Johnson critical 946728773
system info 946734308
Keeping timestamps in epoch time has significant advantages when used with the histogram tools like hist1
and hist2
.
This tool will draw simple 2d scatterplots. Its input should consist of two-column data containing x and y coordinates; optionally, a third column can be provided which indicates the color (1-7) to use for the point. If the viewport bounds are not specified, all data will be read before any output is produced so the viewport can be automatically adjusted to fit the whole dataset. If all viewport bounds are given, the scatterplot will be drawn as data is read, allowing, for example, an animated chart from a tail -f
of an access log. Extents can either be given together as view=<xl>..<xh>,<yl>..<yh>
or individually as xl=<position>, yh=<position>
, etc.
Here is a plot of sin(x)
and cos(x)
:
$ perl -e 'for ($x=-5;$x<5;$x+=.1) {
print join("\t", $x, sin($x), 1)."\n";
print join("\t", $x, cos($x), 2)."\n";
}' \
| dtk plot
This can be especially useful for things like watching the relationship between two values in the logs in realtime; we use this for things like graphing logs that contain, for example, request time vs bytes returned. Here's an example that will animate drawing a spiral in your terminal:
$ perl -e 'use Time::HiRes qw(sleep);
$| = 1;
for ($t=0; $t<10; $t+=.01) {
print join("\t", $t*cos($t), $t*sin($t)) . "\n";
sleep .01;
}' \
| dtk plot view=-10..10,-10..10 flush_size=1
(The $| = 1
line turns on autoflush so the script prints every line individually; flush_size=1
turns off input buffering so each point is rendered, which is only sane for reading slow streams.)
This module takes rows of tab-delimited data on STDIN and rules about how to index and sort the records as command-line arguments. Each row is broken into columns in @v
and converted to sortable keys by each rule. The rows are sorted by the keys given by the rules and flags, giving earlier rules and flags a higher precedence.
Each rule can be a number (to simply sort by that column, starting from index 0), a macro (of the form ":"; see below), or Perl code. Code rules are executed in the context of a do{} block and should produce a sort key by evaluating it as its last expression. Specifying only a number is like specifying code which simply returns that field; that is, 3
is the same as '$v[3]'
. For example, to sort by the third column (2
) and then sort rows with matching third-column values by their first column (0
), pipe TSV data to dtk sort 2 0
. Or, you could sort by the sum of the second (1
) and third (2
) column by piping data to dtk sort '$v[1]+$v[2]'
.
Flags set modes which apply to subsequent rules until another mode overrides the previous. They consist of:
- either
-n
or-s
to switch to a numeric (<=>
) or string (cmp
) comparison - either
-a
or-d
to switch to ascending or descending sort
These flags apply to all subsequent rules until they are overridden by a different flag of the same type. (The default is to sort numeric/ascending as if -an
had been specified.) For example, to sort by the fist column lexically ascending and then by the second column lexically descending, use dtk sort -s 0 -d 1
. To sort by the first column numerically ascending and then by the second column lexically descending, use dtk sort 0 -sd 1
.
DTK sort
uses the rules to precalculate keys for each row; this design choice makes it very fast at the cost of memory. Each rule produces a new key for each row; once all sort keys are calculated, the entire data structure is sorted based on the ordering and comparison rules.
Valid macros are:
time
- convert the given column to a unix timestamp (using Date::Parse) and force
-n
. domain
- the given column is split by dots, reversed, and sorted by
-s
(so "b.com" comes before "a.net"). mixed
- the given column is broken into alternating groups of digit/nondigit parts and sorted by
-n
/-s
respectively (so "a2z" comes before "a10a").
For example, to sort by the first column if it contains human-readable times, pipe data to dtk sort 0:time
.
Here is a sort which takes a three-column input (two numbers and a string) and sorts it by the string (descending) followed by the sum of the two numbers (ascending):
$ perl -e 'my @str=qw(a b c); print rand()."\t".rand()."\t".$str[rand()*@str]."\n" for 1..30' \
| dtk sort -sd '$v[2]' -na '$v[0] + $v[1]'
Here is a sort which produces tab-delimited ip/time pairs from the first ten lines of an apache access log, sorting by date and, where the dates match, by individual IP octets.
$ cat access_log | dtk parse apache_access ip,datetime | head -n 10 | dtk sort 1:time 0:mixed
Here is an example of sorting a list of domains using first the standard GNU sort
utility (much like dtk sort -s 0
) and then using the domain
macro:
$ cat hostlist | sort
a.example2.com
a.example.com
a.example.net
b.example2.com
b.example.com
b.example.net
c.example2.com
c.example2.net
$ cat hostlist | dtk sort 0:domain
a.example.com
b.example.com
a.example2.com
b.example2.com
c.example2.com
a.example.net
b.example.net
c.example2.net
This module expects records containing a single numeric field and calculates statistics from them:
$ perl -e 'print rand()**2, "\n" for 1..100000' | dtk stats
count: 100000
min: 2.15895829436864e-10
max: 0.999996576152458
sum: 33383.415546156
mean: 0.33383415546156
variance: 0.0890887085401628
stddev: 0.298477316625841
skew: 0.638667363455869
This module counts up the number of times it receives each whole record on STDIN. It behaves much like ... | sort | uniq -c | sort -n | head -n ...
, except that it is much faster (by building a hash table instead of sorting, but at the expense of memory) and supports easy ordering and limits. Its only arguments are the number of lines after which output should stop (like head -n ...
) and either asc
or desc
to choose the sorting order (like sort -n
or sort -nr
). By default, dtk uc
will sort the output by count ascending with no output limit. The output is one line of output for every distinct line of input, prefixed by the number of times that line was seen and a tab.
$ perl -e 'print int(rand()**2*10), "\n" for 1..100000' | head -n 10
5
4
1
7
2
5
0
5
4
3
$ perl -e 'print int(rand()**2*10), "\n" for 1..100000' | dtk uc desc 5
31624 0
12971 1
10230 2
8539 3
7400 4
Note: You can read more about histograms on Wikipedia.
These tools are part of the DTK histogram suite. They all take newline-terminated records containing tab-delimited fields on STDIN and use the following conventions.
Options on a specific input column are of the format v<n><p>=<v>
, where <n>
is the input column (counting from zero), <p>
is the property to set, and <v>
is the new value for the property. For example, to set the number of buckets for the second column to 10
, use v1bn=10
. The following options are available for any module in the histogram suite:
v<n>t
- type; 'n' for numeric (group by numeric ranges), 's' for string (group each value separately)
v<n>bn
- bucket number; the integer number of buckets to use
v<n>bs
- bucket size; the width of each bucket
v<n>bd
- bucket disable; set to 1 to disable bucketing; usually set automatically when needed
v<n>bo
- bucket order; 'value' or 'count' (default: 'value')
v<n>cl
- count low; the minimum count a bucket from this column must have to be included (default: no limit)
v<n>f
- format; takes sprintf or strftime formats (as specified by
v<n>ft
) v<n>ft
- format type; either
sprintf
orstrftime
; defaults tosprintf
withv<n>f=%.02f
;v<n>ft=strftime
with nov<n>f
defaults tov<n>f="%F %T"
Only one of bn
/bs
/bd
can be set for each input column and can only be set for numeric (v<n>t=n
) columns.
For example:
- break the first column into ten buckets:
v0bn=10
- buckets of width .2 for the second input column:
v1bs=.2
- first column is 20 formatted buckets, second column is string:
v0bn=20 v0f=%.04f v1t=s
This module produces single-axis histograms. The output shows the minimum and maximum value that went into each bucket (min..max), the number of values that went into the bucket, and the bar of the histogram. The bar is drawn with =
and capped with -
when the last character of the bar would have been at least half-filled.
For example, here's a single-axis histogram of the sum of rolling three six-sided dice:
$ perl -e 'print d6()+d6()+d6(),"\n" for 1..100000; sub d6{int rand(6)+1}' | dtk hist1 v0f=%d v0bs=1
3..3 487 ==-
4..4 1366 =======
5..5 2871 ==============-
6..6 4565 =======================
7..7 7047 ====================================
8..8 9766 ==================================================
9..9 11482 ==========================================================-
10..10 12427 ===============================================================-
11..11 12650 =================================================================
12..12 11722 ============================================================
13..13 9532 ================================================-
14..14 6831 ===================================
15..15 4624 =======================-
16..16 2746 ==============
17..17 1448 =======
18..18 436 ==
Here's an example which unzips a bunch of log files, pulls out the response sizes (converting -
to 0
via :numeric
), filters out the stray hits over 1kb (this server mostly does small queries), and then renders a single-axis histogram (dtk hist1
) with 40 buckets (v0bn=40
) and rendering the values as integers (v0f=%d
).
$ zcat access_log.[123456789].gz \
| dtk parse apache_access bytes:numeric \
| dtk filter 'next unless $v[0] < 1024' \
| dtk hist1 v0f=%d v0bn=40
0..24 341466 ====================================================================================================
26..50 521
51..75 1654
76..100 415
101..125 7214 ==
126..148 80721 =======================-
151..175 9126 ==-
177..193 2439 -
201..224 500
230..243 835
262..274 22
279..300 87
301..324 565
326..349 39
361..375 161
379..395 1015
.. 0
448..448 1
454..463 71
.. 0
523..526 41
530..547 70
559..566 69
581..581 33
612..624 13183 ===-
632..634 117
655..676 98
693..701 390
710..716 99
732..749 104
754..761 42
779..786 265
803..814 183
.. 0
.. 0
.. 0
905..905 2
.. 0
.. 0
1002..1002 4
Above, you can see the minimum and maximum values which landed in each bucket (like 454..463
or 1002..1002
), the number of values which fell into that bucket (like 71
or 4
), and a bar showing the relative magnitude of the number of values in that bucket as compared to the others. Buckets which had no values show no minimum or maximum value (merely ..
).
This module renders two-axis histograms. 2d buckets are formed by pairing each row bucket with each column bucket. All buckets are scaled relative to the size of the single largest bucket. Each bucket is drawn both with a color (black → white in 256-color mode, black → rainbow → white in 16-color mode) and a number (00
→ 99
) indicating the relative height of that bar.
Here is a two-axis histogram which compares the incoming user agents with the time of day for a single day of logs. In it, all buckets are relative to the same scale, so it can be seen that while other
user agents (curl, web crawlers, etc) send requests consistently, they do so at about 1/5 of what firefox
user agents send at peak. It can also be seen that users of this service seem to use it primarily between roughly 9:00am and 9:00pm, but not during the night.
$ zcat access_log.gz | dtk parse apache_access datetime:epoch,useragent:class | dtk hist2 v0ft=strftime v1t=s
Here is another example which shows the number of responses per four-hour block to non-other
user agents broken down by logarithmically-scaled response size:
$ zcat access_log.gz \
| dtk parse apache_access datetime:epoch,bytes:numeric,useragent:class \
| dtk filter 'next if $v[2] eq "other"; $v[1]&&=log($v[1])/log(10); @v=@v[0,1]' \
| dtk hist2 v0ft=strftime v0bs=14400 v1bs=1
This module is identical to hist2
except that buckets are scaled relative to the highest bucket in their respective column rather than to every bucket in the histogram.
Here is the first example from hist2
but using hist2_bycol
instead. In it, because each column (each user agent class) is scaled separately, we can more clearly see the bursts from the other
user agents around 3:00am, 10:00am, and (a smaller one around) midnight.
Here is the second example from hist2
but again using hist2_bycol
instead. With each column scaled separately, we can more clearly see the shape of the distribution for each range of response sizes.
# zcat access_log.gz \
| dtk parse apache_access datetime:epoch,bytes:numeric,useragent:class \
| dtk filter 'next if $v[2] eq "other"; $v[1]&&=log($v[1])/log(10); @v=@v[0,1]' \
| dtk hist2_bycol v0ft=strftime v0bs=14400 v1bs=1
This module is like hist1
except that it takes a second column indicating the weight of the entry (including negative or non-integer weights). The histogram produced will have a zero line down the middle and will show, for each bucket, the relative total positive or negative value.
This example uses similar data to the first example for hist2
and hist2_bycol
; it runs the data through a dtk filter
command which gives requests having an other
user agent a weight of -1
(the control group) and all other requests a weight of 1
(the test group). With this diagram, we pretend that the other
data constitutes some baseline (in this case, "non-human interaction") and we compare it to the rest of the traffic (in this case, "human interaction"), which effectively highlights for us when human users are more likely (green) or less likely (red) to be interacting with the service than an automated user agent:
$ zcat access_log.gz \
| dtk parse apache_access datetime:epoch,useragent:class \
| dtk filter '$v[1] = $v[1] eq "other" ? -1 : 1' \
| dtk delta1 v0ft=strftime v0bs=14400 v1f=%d