-
Notifications
You must be signed in to change notification settings - Fork 0
/
go-statsd.slide
242 lines (132 loc) · 5.67 KB
/
go-statsd.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# This presentation is licensed under the
# Creative Commons Attribution-ShareAlike 4.0 International license:
# https://creativecommons.org/licenses/by-sa/4.0/
Building efficient statsd library in Go
GopherconRu
17 Mar 2018
Andrey Smirnov
Principal Software Developer, Virtustream
me@smira.ru
http://github.com/smira/gopherconru2018
@smira
* Introducing statsd
Statsd is a simple protocol to deliver metrics from services to aggregator (statsd server).
E.g. `service.requests.http.count:1|c` increments counter *service.requests.http.count* by 1.
.image images/statsd1.png
* Protocol
Statsd uses UDP-based text protocol. Metric updates could be packed into single UDP datagram and delivered to statsd server.
Statsd server aggregates all the metric samples and sends aggregated metrics to some time-series database every aggregation interval.
_Footnote_: statsd supports both UDP and TCP, and [[https://github.com/statsite/statsite][binary protocol]] is available.
* Pros and cons
Pros:
- lightweight
- TSDB-agnostic
- low load on TSDB
Cons:
- fixed metric resolution
- hard to track specific instance
- high pps
* Goal
Build high-performance low GC-pressure library to send statsd metrics from Go application.
- minimum performance impact to the application
- zero-allocation per metric sample
- high throughput
* First Version
* Connection per Call
Connect socket/close connection on each metric sample.
.code -numbers stage1/client.go /START OMIT/,/END OMIT/
* Run Benchmark
* NO. STOP. DON'T DO IT YET
* Add Unit (Functional) Tests
Make sure your code works as intended before trying to benchmark or optimize it.
.code -numbers stage1/client_test.go /START TESTINCR OMIT/,/END TESTINCR OMIT/
.code stage1/gotest.out
* Concurrent Test
Statsd client is supposed to be called concurrently from multiple goroutines, so correctness should be verified.
.code -numbers stage1/client_test.go /START CONCURRENT TEST OMIT/,/END CONCURRENT TEST OMIT/
* Run with -race
Go race detector recompiles source with data access instrumentation which catches data races.
.code stage1/gotestrace.out
Hoorah!
No doubt test passes, as client doesn't have any shared state.
* Benchmark
.code -numbers stage1/client_test.go /START BENCHMARK OMIT/,/END BENCHMARK OMIT/
(Micro)benchmark runs function `N` times and captures performance metrics, plus several iterations to get statistically significant results:
$ go test -bench . -benchmem -run nothing -count 5
.code stage1/gobench.txt
* Process Results
`benchstat` is a great tool to render results in a human-readable way:
$ benchstat gobench.txt
.code stage1/gobenchstat.txt
* Second Version
* Connection per Client
We don't need to connect UDP socket each time (and that might include DNS lookup if using hostnames).
Can we share single socket across multiple goroutines? With UDP, yes.
.code -numbers stage2/client.go /START OMIT/,/END OMIT/
* Measure Improvement
$ go test -bench . -benchmem -run nothing -count 5 > new.txt
$ benchstat old.txt new.txt
.code stage2/gobenchstat.txt
* Third Version
* Why So Many Allocs?
Benchmark shows `328` `B/op`, `12` `allocs/op` for 3 lines of code similar to:
.code stage2/excerpt
Why so many allocations? Let's try to do memory profiling for one of the testcases:
.code stage2/profile.txt
* Memory Allocations
.image images/pprof.png
* Removing fmt.Printf()
Can we remove `fmt.Printf()`?
Yes, as format is fixed:
.code -numbers stage3/client.go /START OMIT/,/END OMIT/
* Measure Improvement
.code stage3/gobenchstat.txt
* Fourth Version
* Aggregating Writes
Our client sends single metric value in UDP packet, while statsd server can accept any number of metric samples up to UDP packet size. Samples should be separated with `\n`.
Aggregating many metrics in a single packet should improve performance both for client and server.
.image images/buffer.png
* Aggregating Writes
.code -numbers stage4/client.go /START OMIT/,/END OMIT/
* Benchmark
.code stage4/gobenchstat.txt
Amazing!
Is it?
* Test (Before Benchmark!)
.code stage4/gotestcommands.out
If we add concurrency:
.code stage4/gotestconcurrent.out
Oops...
* Race Detector
.code stage4/gotestrace.out
* Where is the bug?
.code -numbers stage4/client.go /START OMIT/,/END OMIT/ HLrace
* Fifth Version
* Proper Locking
We need to protect access to packet with mutex:
.code -numbers stage5/client.go /START OMIT/,/END OMIT/
This reduces concurrency (all statsd operations are seralized), and the worst thing is that socket operations are covered with the same lock.
* Benchmark Correct Version
.code stage5/gobenchstat.txt
Still amazing improvement!
_Note_: this benchmark sends UDP packet over `localhost`, so networking overhead is negligible. Real network might incur much higher latency.
Benchmark on runtime platform (e.g. OS X vs. Linux).
* Further Improvements
- decouple networking operations and application code path
- scale out network send operations
- reconnect UDP socket on interval to follow DNS changes
- buffer pool
- periodic flush of incomplete packets
* Architecture
.image images/arch.png
* github.com/smira/go-statsd
* Benchmark
.code go-statsd/gobenchstat.txt
Just 2x faster than our last (simple) version.
* Conclusion
- microbenchmarks ([[https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go][go test -bench]], [[https://godoc.org/golang.org/x/perf/cmd/benchstat][benchstat]])
- profiling ([[https://blog.golang.org/profiling-go-programs][go tool pprof]])
- race detector ([[https://blog.golang.org/race-detector][go test -race]])
- production library ([[https://github.com/smira/go-statsd][github.com/smira/go-statsd]])
Overall improvement (`1/400th` of time, zero allocation):
.code stage5/gobenchstatoverall.txt