Strings manifest themselves in Go compiled binaries in a number of different ways. Below are 2 deep dives into different examples. There are of course many other examples that could be worked through.
Take the following Go program:
package main
const x = "banana"
func main() {
print(x)
}
Compile it and then decompile the main function:
➜ test $ go build
➜ test $ objdump -macho -disassemble -dis-symname='_main.main' -x86-asm-syntax=intel test
test:
(__TEXT,__text) section
_main.main:
1056e50: 65 48 8b 0c 25 30 00 00 00 mov rcx, qword ptr gs:[48]
1056e59: 48 3b 61 10 cmp rsp, qword ptr [rcx + 16]
1056e5d: 76 3b jbe 0x1056e9a
1056e5f: 48 83 ec 18 sub rsp, 24
1056e63: 48 89 6c 24 10 mov qword ptr [rsp + 16], rbp
1056e68: 48 8d 6c 24 10 lea rbp, [rsp + 16]
1056e6d: e8 2e 36 fd ff call _runtime.printlock
1056e72: 48 8d 05 d3 bf 01 00 lea rax, [rip + 114643]
1056e79: 48 89 04 24 mov qword ptr [rsp], rax
1056e7d: 48 c7 44 24 08 06 00 00 00 mov qword ptr [rsp + 8], 6
1056e86: e8 55 3f fd ff call _runtime.printstring
1056e8b: e8 90 36 fd ff call _runtime.printunlock
1056e90: 48 8b 6c 24 10 mov rbp, qword ptr [rsp + 16]
1056e95: 48 83 c4 18 add rsp, 24
1056e99: c3 ret
1056e9a: e8 31 9d ff ff call _runtime.morestack_noctxt
1056e9f: eb af jmp _main.main
There are a few things going on here. The most relevant lines to us right now are these four:
1056e72: 48 8d 05 d3 bf 01 00 lea rax, [rip + 114643]
1056e79: 48 89 04 24 mov qword ptr [rsp], rax
1056e7d: 48 c7 44 24 08 06 00 00 00 mov qword ptr [rsp + 8], 6
1056e86: e8 55 3f fd ff call _runtime.printstring
Taking these line by line:
lea rax, [rip + 114643]
- thelea
instruction is Load Effective Address.rip
is the instruction pointer register. Here thelea
instruction is being used to calculate some memory offset relative to the current instruction (or more accurately the next instruction, discussed further down). The result of that calculation is placed into therax
register.mov qword ptr [rsp], rax
- moves the value held into therax
register into the memory location pointed to by thersp
register (stack pointer).mov qword ptr [rsp + 8], 6
- move6
into the memory location pointed to by the stack pointer + 8. This carries the length of the string.call runtime.printstring
- calls the print string procedure
What do we need to understand from all of this?
- The calling convention used by Go requires that arguments are passed on the stack. This is unlike System V x86-64 which uses a specific set of registers for arguments, only leveraging the stack once those are exhausted. It is not covered in these 4 lines, but Go also returns values on the stack (again different to System V x86-64).
- The string value we're interested in loaded into memory. The offset is known at compile time, so that would suggest
the value is contained in the
__DATA
segment (most likely in the__rodata
section - we'll check this in a moment). - Go passes the string length to the function - so it is more than likely that Go strings are not NULL terminated.
- The pointer and length are passed together in adjacent stack memory. So these are either 2 arguments or 1 argument where the type is a struct containing both values.
As you can see, we have enough information in the above to work out where the string is stored in memory, and the length of the string.
So, how do we obtain the sting value without running the executable? objdump
is displaying the address on the left,
and conveniently we can just add the offset to this.
1056e72: 48 8d 05 d3 bf 01 00 lea rax, [rip + 114643]
1056e79: 48 89 04 24 mov qword ptr [rsp], rax
So from this we can calculate the address from where the string starts:
start_address = 0x1056e79 + 114643 = 0x1072e4c
And since we have the length, we can also work out where it ends. In our example, the length is 6
. The start address
points to the first character, so we only need to add 5
to cover the remaining characters:
start_address = 0x1056e79 + 114643 = 0x1072e4c
end_address = start_address + 5 = 0x1072e51
I mentioned before that the string is likely stored in the __rodata
section. Now is a good moment to verify that. We
can use objdump
to inspect this particular section:
➜ test $ objdump -s -j __rodata test
<snip>
1072e40 2c206e6f 74205343 48454420 62616e61 , not SCHED bana
1072e50 6e616566 656e6365 6f626a65 6374706f naefenceobjectpo
<snip>
Let's break this down:
address | byte | note |
---|---|---|
1072e40 | , | |
1072e41 | ||
1072e42 | n | |
1072e43 | o | |
1072e44 | t | |
1072e45 | ||
1072e46 | S | |
1072e47 | C | |
1072e48 | H | |
1072e49 | E | |
1072e4a | D | |
1072e4b | ||
1072e4c | b | ← our start address |
1072e4d | a | |
1072e4e | n | |
1072e4f | a | |
1072e50 | n | |
1072e51 | a | ← our end address |
1072e52 | e | |
1072e53 | f | |
1072e54 | e | |
1072e55 | n | |
1072e56 | c | |
1072e57 | e | |
1072e58 | o | |
1072e59 | b | |
1072e5a | j | |
1072e5b | e | |
1072e5c | c | |
1072e5d | t | |
1072e5e | p | |
1072e5f | o |
This all lines up nicely!
So with this we have everything we need to extract string constants from the binary. Note that this only covers the simple case where a single string argument is passed to a function; where string constants are referenced in other settings, the sequence of instructions will be different (but they should always at least reference the string length and location). These won't be discussed here, as there isn't any fundamental difference in the way they location and length can be extracted.
Now, if we experience a set of instructions that look like they relate to string constants, there is always the
possibility that they don't. Thus far we only know the data resides in __rodata
for our example, but nothing more.
__rodata
can be used for a number of things, so there is no guarantee we are dealing with a string constant. So, we
might need another heuristic. Fortunately for us, Go adds a handy symbol that indicates which chunk of data relates
to constant strings - this is named go.string.*
. We can see the address where this starts by using the -t
flag for
objdump
:
➜ test $ objdump -t test | rg 'go\.string\.\*'
0000000001072c38 l O __TEXT,__rodata _go.string.*
So this tells us the go.string.*
data starts at 1072c38
. Let's have a look at __rodata
again:
➜ test $ objdump -s -j __rodata test
<snip>
1072c30 81060000 00000000 2028292b 2c2d2e2f ........ ()+,-./
1072c40 3a3c3d3f 5b0a095d 202b2040 2050205b :<=?[..] + @ P [
1072c50 2920290a 2c202d3e 3a203e20 220a0a20 ) )., ->: > "..
<snip>
The address indicated is towards the end of the first line. So that looks about right - a bunch of ASCII characters starting at that particular address. So, armed with this, we have an additional heuristic; if an instruction references addresses within that block, it's highly likely a string. If it is outside that block, we can ignore it.
Armed with all of this, we can do a pretty good job of extracting strings. Unfortunately not all cases appear like this..
Another program that is identical to the 1st example in terms of what it does:
package main
import "fmt"
const x = "banana"
func main() {
fmt.Print(x)
}
Build and decompile:
➜ test $ go build
➜ test $ objdump -macho -disassemble -dis-symname='_main.main' -x86-asm-syntax=intel test
test:
(__TEXT,__text) section
_main.main:
109cfa0: 65 48 8b 0c 25 30 00 00 00 mov rcx, qword ptr gs:[48]
109cfa9: 48 3b 61 10 cmp rsp, qword ptr [rcx + 16]
109cfad: 76 70 jbe 0x109d01f
109cfaf: 48 83 ec 58 sub rsp, 88
109cfb3: 48 89 6c 24 50 mov qword ptr [rsp + 80], rbp
109cfb8: 48 8d 6c 24 50 lea rbp, [rsp + 80]
109cfbd: 0f 57 c0 xorps xmm0, xmm0
109cfc0: 0f 11 44 24 40 movups xmmword ptr [rsp + 64], xmm0
109cfc5: 48 8d 05 74 e2 00 00 lea rax, [rip + 57972]
109cfcc: 48 89 44 24 40 mov qword ptr [rsp + 64], rax
109cfd1: 48 8d 05 28 b8 04 00 lea rax, [rip + 309288]
109cfd8: 48 89 44 24 48 mov qword ptr [rsp + 72], rax
109cfdd: 48 8b 05 94 e0 0d 00 mov rax, qword ptr [rip + _os.Stdout]
109cfe4: 48 8d 0d 95 d0 04 00 lea rcx, [rip + "_go.itab.*os.File,io.Writer"]
109cfeb: 48 89 0c 24 mov qword ptr [rsp], rcx
109cfef: 48 89 44 24 08 mov qword ptr [rsp + 8], rax
109cff4: 48 8d 44 24 40 lea rax, [rsp + 64]
109cff9: 48 89 44 24 10 mov qword ptr [rsp + 16], rax
109cffe: 48 c7 44 24 18 01 00 00 00 mov qword ptr [rsp + 24], 1
109d007: 48 c7 44 24 20 01 00 00 00 mov qword ptr [rsp + 32], 1
109d010: e8 8b 99 ff ff call _fmt.Fprint
109d015: 48 8b 6c 24 50 mov rbp, qword ptr [rsp + 80]
109d01a: 48 83 c4 58 add rsp, 88
109d01e: c3 ret
109d01f: e8 8c c4 fb ff call _runtime.morestack_noctxt
109d024: e9 77 ff ff ff jmp _main.main
This is.. quite different. First let's identify the relevant section:
109cfc5: 48 8d 05 74 e2 00 00 lea rax, [rip + 57972]
109cfcc: 48 89 44 24 40 mov qword ptr [rsp + 64], rax
109cfd1: 48 8d 05 28 b8 04 00 lea rax, [rip + 309288]
109cfd8: 48 89 44 24 48 mov qword ptr [rsp + 72], rax
You'll notice these lines don't contain anything indicating a string length. It's not entirely clear, but the
signature of fmt.Print
may give us a clue:
func Print(a ...interface{}) (n int, err error)
So it receives empty interfaces as arguments. Given the way interfaces work, it's entirely possible what we're seeing here is something that indicates type along with a pointer to the underlying value. Let's check that first reference and run with the theory that it is carrying some type indication:
109cfc5: 48 8d 05 74 e2 00 00 lea rax, [rip + 57972]
109cfcc: 48 89 44 24 40 mov qword ptr [rsp + 64], rax
address = 0x109cfcc + 57972 = 0x10ab240
Looking this up we find:
<snip>
10ab240 10000000 00000000 08000000 00000000 ................
10ab250 b45cffe0 07080818 20460d01 00000000 .\...... F......
10ab260 a0630e01 00000000 64170000 e0b20000 .c......d.......
<snip>
A bit of sleuthing shows up https://golang.org/src/runtime/typekind.go as a potential source of type identifiers. Here
kindString
is 24
(0x18
). In that block of data, there is only one 0x18
, at address 0x10ab257
(+23 from the
original address). Perhaps that is holding the type kind, let's test our theory with a different type:
package main
import (
"fmt"
)
const x uint16 = 60000
func main() {
fmt.Print(x)
}
➜ test $ go build
➜ test $ objdump -macho -disassemble -dis-symname='_main.main' -x86-asm-syntax=intel test
test:
(__TEXT,__text) section
_main.main:
109cfa0: 65 48 8b 0c 25 30 00 00 00 mov rcx, qword ptr gs:[48]
109cfa9: 48 3b 61 10 cmp rsp, qword ptr [rcx + 16]
109cfad: 76 70 jbe 0x109d01f
109cfaf: 48 83 ec 58 sub rsp, 88
109cfb3: 48 89 6c 24 50 mov qword ptr [rsp + 80], rbp
109cfb8: 48 8d 6c 24 50 lea rbp, [rsp + 80]
109cfbd: 0f 57 c0 xorps xmm0, xmm0
109cfc0: 0f 11 44 24 40 movups xmmword ptr [rsp + 64], xmm0
109cfc5: 48 8d 05 f4 e2 00 00 lea rax, [rip + 58100]
109cfcc: 48 89 44 24 40 mov qword ptr [rsp + 64], rax
109cfd1: 48 8d 05 ce b2 04 00 lea rax, [rip + 307918]
109cfd8: 48 89 44 24 48 mov qword ptr [rsp + 72], rax
109cfdd: 48 8b 05 94 e0 0d 00 mov rax, qword ptr [rip + _os.Stdout]
109cfe4: 48 8d 0d 75 d0 04 00 lea rcx, [rip + "_go.itab.*os.File,io.Writer"]
109cfeb: 48 89 0c 24 mov qword ptr [rsp], rcx
109cfef: 48 89 44 24 08 mov qword ptr [rsp + 8], rax
109cff4: 48 8d 44 24 40 lea rax, [rsp + 64]
109cff9: 48 89 44 24 10 mov qword ptr [rsp + 16], rax
109cffe: 48 c7 44 24 18 01 00 00 00 mov qword ptr [rsp + 24], 1
109d007: 48 c7 44 24 20 01 00 00 00 mov qword ptr [rsp + 32], 1
109d010: e8 8b 99 ff ff call _fmt.Fprint
109d015: 48 8b 6c 24 50 mov rbp, qword ptr [rsp + 80]
109d01a: 48 83 c4 58 add rsp, 88
109d01e: c3 ret
109d01f: e8 8c c4 fb ff call _runtime.morestack_noctxt
109d024: e9 77 ff ff ff jmp _main.main
address = 0x109cfcc + 58100 = 0x10ab2c0
10ab2c0 02000000 00000000 00000000 00000000 ................
10ab2d0 a00ef2ef 0f020209 58440d01 00000000 ........XD......
10ab2e0 98630e01 00000000 6e170000 e0b50000 .c......n.......
So let's add 23 to our address:
0x10ab2c0 + 23 = 0x10ab2d7
This address holds a value of 0x09
. kindUint16
is also 0x09
!
A bit more sleuthing suggests this block of memory may well be runtime._type
or some variation to it. Counting the bytes for each field:
type _type struct {
size uintptr // +8 bytes
ptrdata uintptr // +8 bytes
hash uint32 // +4 bytes
tflag tflag // +1 byte
align uint8 // +1 byte
fieldAlign uint8 // +1 byte
kind uint8
// <snip>
}
23 bytes to reach kind
.
So we have a likely way to obtain the type. So what about the second reference? Could this be the string value? Let's check:
109cfd1: 48 8d 05 28 b8 04 00 lea rax, [rip + 309288]
109cfd8: 48 89 44 24 48 mov qword ptr [rsp + 72], rax
address = 0x109cfd8 + 309288 = 0x10e8800
10e8800 5ace0c01 00000000 06000000 00000000 Z...............
No banana
! Well actually that makes sense. As discussed in Example 1, Go always wants to deal with strings by carrying
a pointer and length. So perhaps that's what we have here? The most obvious thing to look for is the length of our
string, which is 6. We can see that in the latter 8 bytes. This is Mach-O and little endian, so we need to do a bit of
re-arranging to get those 8 bytes as big endian:
little endian: 0600000000000000
big endian: 0000000000000006
OK, so a value of 6! That matches nicely.
So the other 8 bytes may well be a pointer, so let's check that out. Switching to big endian again:
little endian = 5ace0c0100000000
big endian = 00000000010cce5a
So a potential address of 0x10cce5a
10cce50 6e63686f 5b5d6279 74656261 6e616e61 ncho[]bytebanana
10cce60 6368616e 3c2d6566 656e6365 6572726e chan<-efenceerrn
banana
located! With all of this we have another means to locate strings. Unfortunately the block of data holding
string pointer and length does not have any obvious symbols assigned. (I believe this area is referred
to as statictmp
or stmp
- some earlier versions of Go did leave symbols here, newer versions do not.
There may be some hope for them making a return.)