-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve StreamBuf append #35928
Improve StreamBuf append #35928
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Interestingly, the benchmark results across all the tests seem to vary. Main:
This PR:
Tested on another system, got the same results, with Also did some basic instrumenting, and the I suppose this comes down to what use cases we hit most often in filebeat. Going to ask around. |
To help compare:
|
@andrewkroh 's comparisions kinda confirm what I was seeing just squinting at the different benchmarks, and that the delta in long-lines performance is pretty large compared to the sporadic slower results. |
This is my attempt to explain what is happening and why bytes.Buffer is faster.
I think the reason for this is that the slice here was starting with 0 allocated bytes, and Here is the newcap := old.cap
doublecap := newcap + newcap
if cap > doublecap {
newcap = cap
} else { To contrast here is the implementation of the
func NewBuffer(buf []byte) *Buffer { return &Buffer{buf: buf} } The Write implementation is where the allocation happens, and is essentially just a more optimized allocation paired with a call to func (b *Buffer) Write(p []byte) (n int, err error) {
b.lastRead = opInvalid
m, ok := b.tryGrowByReslice(len(p))
if !ok {
m = b.grow(len(p))
}
return copy(b.buf[m:], p), nil
} The For small buffers // smallBufferSize is an initial allocation minimal capacity.
const smallBufferSize = 64
//...
if b.buf == nil && n <= smallBufferSize {
b.buf = make([]byte, n, smallBufferSize)
return 0
} This trades space for speed by over-allocating assuming we'll either grow or rewrite the buffer enough times that it will be worth it. This skips all of the allocations from 2-64 by just doing a single allocation. We eventually end up in // TODO(http://golang.org/issue/51462): We should rely on the append-make
// pattern so that the compiler can call runtime.growslice. For example:
// return append(b, make([]byte, n)...)
// This avoids unnecessary zero-ing of the first len(b) bytes of the
// allocated slice, but this pattern causes b to escape onto the heap.
//
// Instead use the append-make pattern with a nil slice to ensure that
// we allocate buffers rounded up to the closest size class.
c := len(b) + n // ensure enough space for n elements
if c < 2*cap(b) {
// The growth rate has historically always been 2x. In the future,
// we could rely purely on append to determine the growth rate.
c = 2 * cap(b)
}
b2 := append([]byte(nil), make([]byte, c)...)
copy(b2, b)
return b2[:len(b)] The net result of all this is that I almost wonder if the initial 64 byte allocation is doing most of the work for us here, but there are a lot of places in bytes.Buffer that try really hard not to allocate if it doesn't have to and then when it does allocate it tries to allocate in an optimal way. |
Great analysis, @cmacknz. I appeared to me like the There appears to be a lot opportunity for optimization in the
|
Yah, never touched this chunk of code, but I'm not surprised there's room for optimization here. Unless we want to prioritize this right now, I feel like merging this so we at least get some performance delta is the way to go? |
I agree with @fearful-symmetry, let's merge it and follow up in another PR afterwards. @fearful-symmetry could you please review/approve this PR then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this appears to be safe and the microbenchmarks show a very obvious improvement. Let's get it merged.
I am still curious about exactly which piece of the bytes.Buffer implementation is helping us the most here, but I can experiment with that later. From my previous comment there are several paths that could be helping, so this isn't a fluke improvement there is concrete evidence in the code itself that this approach is better (even if it isn't obvious without looking much deeper).
@jeniawhite please as a changelog entry to https://github.com/elastic/beats/blob/main/CHANGELOG.next.asciidoc before merging this. |
This pull request is now in conflicts. Could you fix it? 🙏
|
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
Do we want to deal with the linter errors while we're here? |
* Improve streambuf append * Adding changelog comment * Update CHANGELOG.next.asciidoc Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
What does this PR do?
Looking at the code, I saw that the streambuf has a method called
doAppend
.This method gets bytes and appends them to the bytes that the streambuf instance holds.
I've noticed that the appending on the bytes was done using the
append
built-in function.In order to gain performance I wanted to replace the whole bytes slice that is being held by streambuf with a
bytes.Buffer
implementation, but since we manipulate the cursor I've decided to skip it.Changed only the append logic to use the
bytes.Buffer
.Running the pre-existing benchmarks I was able to see improvement in performance for long-lines (from my understanding this is a common use-case for client of
filebeat
).This is the old result:
Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=. goos: darwin goarch: arm64 pkg: github.com/elastic/beats/v7/libbeat/reader/readfile BenchmarkEncoderReader/long_lines-10 1230 878272 ns/op 1000100 processed_bytes 100.0 processed_lines 7849648 B/op 1289 allocs/op PASS ok github.com/elastic/beats/v7/libbeat/reader/readfile 1.876s
This is the new result with the changes:
Evgbs-MacBook-Pro:readfile evgb$ go test -benchmem -bench=. goos: darwin goarch: arm64 pkg: github.com/elastic/beats/v7/libbeat/reader/readfile BenchmarkEncoderReader/long_lines-10 1675 680225 ns/op 1000100 processed_bytes 100.0 processed_lines 5095600 B/op 770 allocs/op PASS ok github.com/elastic/beats/v7/libbeat/reader/readfile 1.865s
Looking at the old profiling:
Looking at the new profiling:
Why is it important?
This is a component in the heart of
libbeat
that manages buffers and is being utilized by multiple products.Usually, this component is utilized in the hottest paths (every line in
filebeat
is processed by this code using thelinereader
).Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.