Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: S2++ #846

Closed
wants to merge 13 commits into from
52 changes: 52 additions & 0 deletions s2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1022,6 +1022,7 @@ See [using indexes](https://github.com/klauspost/compress/tree/master/s2#using-i
* Frame [Stream identifier](https://github.com/google/snappy/blob/master/framing_format.txt#L68) changed from `sNaPpY` to `S2sTwO`.
* [Framed compressed blocks](https://github.com/google/snappy/blob/master/format_description.txt) can be up to 4MB (up from 64KB).
* Compressed blocks can have an offset of `0`, which indicates to repeat the last seen offset.
* If the first bytes of a block is `0x80, 0x00, 0x00` (copy, 2 byte offset = 0), this indicates that all [Copy with 4-byte offset (11)](https://github.com/google/snappy/blob/main/format_description.txt#L106) are all 3 bytes instead for the remainder of the block.

Repeat offsets must be encoded as a [2.2.1. Copy with 1-byte offset (01)](https://github.com/google/snappy/blob/master/format_description.txt#L89), where the offset is 0.

Expand All @@ -1047,6 +1048,57 @@ The first copy of a block cannot be a repeat offset and the offset is reset on e

Default streaming block size is 1MB.

## S2++ Mode

If the first bytes of a block is `0x80, 0x00, 0x00` (copy, 2 byte offset = 0),
this indicates that all [Copy with 2-byte offset (10)](https://github.com/google/snappy/blob/main/format_description.txt#L98)
and [Copy with 4-byte offset (11)](https://github.com/google/snappy/blob/main/format_description.txt#L106) tags change.

There can be no literals before this tag and no repeats before a match as specified above.
This will only trigger on this exact tag.

## Tag 0x2 (TagCopy2)

The length field now has a base value of 4 and there are 3 special valaues for longer matches.

| Bits | Meaning | Description |
|------|---------|------------------------------------------------------------------------|
| 0-1 | Tag | Always 0x2 |
| 2-7 | Length | Length of copy or repeat<br/>Values are 0-63. See decoding table below |

| Value | Output |
|-------|---------------------|
| 0-60 | Base + Value |
| 61 | Base + Read 1 byte |
| 62 | Base + Read 2 bytes |
| 63 | Base + Read 3 bytes |

Base value is 4 for all copies.

Offsets are encoded as 2 bytes following the length.
The maximum backreference offset is therefore 65535.

## Tag 0x3 (TagCopy4)

| Bits | Meaning | Description |
|------|---------|------------------------------------------------------------------------|
| 0-1 | Tag | Always 0x3 |
| 2 | Repeat | 0 if copy, 1 if repeat. |
| 3-7 | Length | Length of copy or repeat<br/>Values are 0-31. See decoding table below |

| Value | Output |
|-------|---------------------|
| 0-28 | Base + Value |
| 29 | Base + Read 1 byte |
| 30 | Base + Read 2 bytes |
| 31 | Base + Read 3 bytes |

For copy operations the Base value is `4` For repeat, the base value is `1`.

Copy offsets are encoded as `3` bytes following the length. The maximum backreference offset is therefore 16777215.

The S2 repeat encoding specified on TagCopy2 is not valid in this mode.

# Dictionary Encoding

Adding dictionaries allow providing a custom dictionary that will serve as lookup in the beginning of blocks.
Expand Down
97 changes: 60 additions & 37 deletions s2/encode_best.go
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,7 @@ func encodeBlockBest(dst, src []byte, dict *Dict) (d int) {
offset := s - best.offset
s += best.length

if offset > 65535 && s-base <= 5 && !best.rep {
if offset > 65535 && s-base <= 4 && !best.rep {
// Bail if the match is equal or worse to the encoding.
s = best.s + 1
if s >= sLimit {
Expand Down Expand Up @@ -716,35 +716,26 @@ emitRemainder:
// 4 <= length && length <= 1 << 24
func emitCopySize(offset, length int) int {
if offset >= 65536 {
i := 0
if length > 64 {
length -= 64
if length >= 4 {
// Emit remaining as repeats
return 5 + emitRepeatSize(offset, length)
}
i = 5
}
if length == 0 {
return i
// 3 Byte offset + Variable length (base length 4).
length -= 3
if length > 28 {
length -= 28
}
return i + 5
return 3 + emitRepeatSize(offset, length)
}

// Offset no more than 2 bytes.
if length > 64 {
if offset < 2048 {
// Emit 8 bytes, then rest as repeats...
return 2 + emitRepeatSize(offset, length-8)
if offset < 1024 {
if length < 11+8 {
// Emit up to 18 bytes with short offset.
return 2
}
if length < 18+256 {
return 3
}
// Emit remaining as repeats, at least 4 bytes remain.
return 3 + emitRepeatSize(offset, length-60)
}
if length >= 12 || offset >= 2048 {
return 3
}
// Emit the remaining copy, encoded as 2 bytes.
return 2
// 2 byte offset + Variable length (base length 4).
return emitCopy2Size(length)
}

// emitCopyNoRepeatSize returns the size to encode the offset+length
Expand All @@ -755,7 +746,7 @@ func emitCopySize(offset, length int) int {
// 4 <= length && length <= 1 << 24
func emitCopyNoRepeatSize(offset, length int) int {
if offset >= 65536 {
return 5 + 5*(length/64)
return 4 + 4*(length/64)
}

// Offset no more than 2 bytes.
Expand All @@ -771,26 +762,58 @@ func emitCopyNoRepeatSize(offset, length int) int {
}

// emitRepeatSize returns the number of bytes required to encode a repeat.
// Length must be at least 4 and < 1<<24
// Length must be at least 1 and < 1<<24
func emitRepeatSize(offset, length int) int {
// Repeat offset, make length cheaper
if length <= 4+4 || (length < 8+4 && offset < 2048) {
if length <= 0 {
return 0
}

if length <= 29 {
return 1
}
length -= 29
if length <= 256 {
return 2
}
if length < (1<<8)+4+4 {
if length <= 65536 {
return 3
}
if length < (1<<16)+(1<<8)+4 {
return 4
}
const maxRepeat = (1 << 24) - 1
length -= (1 << 16) - 4
left := 0
if length > maxRepeat {
left = length - maxRepeat + 4
left = length - maxRepeat
}
if left > 0 {
return 5 + emitRepeatSize(offset, left)
return 4 + emitRepeatSize(offset, left)
}

// emitCopy2Size returns the number of bytes required to encode a copy2.
// Length must be less than 1<<24
func emitCopy2Size(length int) int {
length -= 4
if length < 0 {
// Should not happen, but we keep it so caller doesn't have to check.
return 2
}

if length <= 60 {
// Length inside tag.
return 1 + 2
}
length -= 60
if length <= 256 {
// Length in 1 byte.
return 2 + 2
}
if length <= 65536 {
// Length in 2 bytes.
return 3 + 2
}
// Length in 3 bytes.
// Anything remaining must be repeats.
const maxRepeat = (1 << 24) - 1
left := 0
if length > maxRepeat {
left = length - maxRepeat
}
return 5
return 2 + 4 + emitRepeatSize(0, left)
}
2 changes: 1 addition & 1 deletion s2/encode_better.go
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ func encodeBlockBetterGo(dst, src []byte) (d int) {
candidateL += 8
}

if offset > 65535 && s-base <= 5 && repeat != offset {
if offset > 65535 && s-base <= 4 && repeat != offset {
// Bail if the match is equal or worse to the encoding.
s = nextS + 1
if s >= sLimit {
Expand Down
Loading
Loading