Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2926 BSON Binary Vector Subtype Support #1658

Merged
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ec64aa9
Added bson_corpus test new binary subtype 9: vectors
caseyclements Sep 13, 2024
d5ab5f1
Added first draft of Binary Vector subtype spec markdown
caseyclements Sep 16, 2024
91212ca
Move bson-binary-vector.md from bson-corpus to its own dir
caseyclements Sep 17, 2024
8757836
Updates based on feedback.
caseyclements Sep 17, 2024
830632a
Added README.md for Binary Vector tests
caseyclements Sep 20, 2024
67b410d
Added tests for binary vector subtype
caseyclements Sep 20, 2024
07667a1
Broke tests into 3 files by dtype
caseyclements Sep 20, 2024
80f19fa
Added github link in Reference Implementation
caseyclements Sep 20, 2024
7255b6c
PyArrow -> Arrow
caseyclements Sep 20, 2024
0ff289b
Added reference to jira ticket
caseyclements Sep 21, 2024
b3d6ea0
Added example for Binary structure
caseyclements Sep 23, 2024
8cfc15a
Added table visualization of binary structure
caseyclements Sep 23, 2024
a6ee71b
typo
caseyclements Sep 23, 2024
0d10725
Updates from Anna's comments
caseyclements Sep 26, 2024
5935ce0
Correction from Shane's comment
caseyclements Sep 26, 2024
a8b464e
Fix typo in binary structure html table
caseyclements Sep 26, 2024
f50677b
Moved editorial comments about PACKED_BIT ambiguity to an FAQ
caseyclements Sep 27, 2024
2d4ea72
Added Required Tests section to README. Removed JSON from tests.
caseyclements Sep 27, 2024
f50b1cc
Further improvements to PACKED_BIT with padding example.
caseyclements Sep 27, 2024
d6f160b
Fixed consistency for subtype reference. Follows Tech Design Doc
caseyclements Sep 28, 2024
60088d9
Addressed Neal's comments.
caseyclements Oct 7, 2024
a1b87f7
Made clear that it is the least significant bit that is ignored.
caseyclements Oct 7, 2024
d267b2a
Additional invalid test cases for PACKED_BIT vectors
caseyclements Oct 7, 2024
2cd0b4a
Merge branch 'master' into DRIVERS-2926-BSON-Binary-Vectors
caseyclements Oct 7, 2024
0b888fb
Additional float32 binary vector test cases
caseyclements Oct 7, 2024
c823174
Added API Guidance section
caseyclements Oct 8, 2024
fcc1be5
Change mention of binary subtype from x09 to 9
caseyclements Oct 16, 2024
d00541a
Remove github link to pymongo.binary. Reference implementation now si…
caseyclements Oct 16, 2024
b30ed35
Clarification of test requirements for drivers that natively support …
caseyclements Oct 22, 2024
0ccc399
Adds Validation subsection
caseyclements Oct 22, 2024
ae32422
Add note about signature of from_vector, that it be implemented as fr…
caseyclements Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions source/bson-binary-vector/bson-binary-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# BSON Binary Subtype 9 - Vector

- Status: Pending
- Minimum Server Version: N/A

______________________________________________________________________

## Abstract

This document describes the addition of a new subtype to the Binary BSON type. This subtype is used for efficient
caseyclements marked this conversation as resolved.
Show resolved Hide resolved
storage and retrieval of vectors. Vectors here refer to densely packed arrays of numbers, all of the same type.

## Motivation

These representations correspond to the numeric types supported by popular numerical libraries for vector processing,
such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed
format used by these libraries can result in up to significant memory savings and processing efficiency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"result in up to significant" -> "result in significant"


### META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Specification

This specification introduces a new BSON binary subtype, the vector, with value `"\x09"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"\x09" -> 9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a can of worms. I'm going to leave it as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take this back. I will use simple integers when describing subtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs updating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems to be hiding a lot of complexity. What should the driver APIs look like? What happens if the "padding" field is non-zero but the dtype is a multiple of 8? How does the padding field change the output? Are we planning to add a NONPACKED_BIT type which represents the data as unit1 (or bool) eg the user would actually give [1, 0, 0, 1] for a 4-bit vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that it is up to the driver to implement the API as they please. Is that not true?

If the padding is non-zero for a dtype where it does not make sense, then the tests will fail.


#### Data Types

Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented.

| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) |
| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- |
| `0x03` | INT8 | 8 | INT8 |
| `0x27` | FLOAT32 | 32 | FLOAT |
| `0x10` | PACKED_BIT | 1 `*` | BOOL |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous documents I saw had other data types too, why is this spec limited to these 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the defined work. The other document is not up-to-date or correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I gained some clarity on the initial requirements. They were for implementation in Python and MongoT. This specification is under the same time pressure, so perhaps we discuss the full design and whether this should be expanded to include those that are not yet implemented.


`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of
integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16 bit vector
caseyclements marked this conversation as resolved.
Show resolved Hide resolved
`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course,
some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk.

The authors are well-aware of the inherent ambiguity here, and alternatives. This is a market-standard, unfortunately.
Change is inevitable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is out of place. We should move this to a "rationale" or "Q&A" section like we do for other specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole bit was placed here to make the * more easy to understand. I would prefer it to stay as-is. Going into detail about the brand new field of Quantization in LLMs is not something I want to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Shane here. We generally avoid editorializing like this in our RFC-like specifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • f50677b - Moved editorial comments about PACKED_BIT ambiguity to an FAQ


#### Byte padding

As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of
bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the
final byte that are to be ignored.
nbbeeken marked this conversation as resolved.
Show resolved Hide resolved

#### Binary structure

Following the binary subtype `0x09` a two-element byte array of metadata precedes the packed numbers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x09 -> 9. We usually just refer to the subtype as a regular number. For example in https://github.com/mongodb/specifications/blob/master/source/bson-binary-encrypted/binary-encrypted.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that there is inconsistency. It is coming from the Tech Spec document. For our purposes, the drivers team, I am happy to make it consistent. When we refer to "subtype" we use integers. When we refer to "dtype", we use hex. Cool? @ShaneHarvey


- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may
increase.

- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value.

- The remainder contains the actual vector elements packed according to dtype.

For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3` would look like this:
`b"\x10\x03\x06\x07'`: 1 byte for dtype, 1 for padding, and 1 for each uint8.
nbbeeken marked this conversation as resolved.
Show resolved Hide resolved

<table border="1" cellspacing="0" cellpadding="5">
<tr>
<td colspan="8">1st byte: dtype (from list in previous table) </td>
nbbeeken marked this conversation as resolved.
Show resolved Hide resolved
<td colspan="8">2nd byte: padding (values in [0,7])</td>
<td colspan="1">binary numbers packed according to dtype</td>
</tr>
<tr>
<td>0</td>
jyemin marked this conversation as resolved.
Show resolved Hide resolved
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>...</td>
</tr>
</table>

All values use the little-endian format.

### Reference Implementation

- PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file path might change, let's just use the full JIRA link to the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Test Plan

See the [README](tests/README.md) for tests.
40 changes: 40 additions & 0 deletions source/bson-binary-vector/tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Testing Binary subtype 9: Vector

The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to
the specification.

These tests focus on the roundtrip of the list numbers as input/output, along with their data type and byte padding.

Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector
to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype.

Each test case here pertains to a single vector. The inputs required to create the Binary BSON object are defined, and
when valid, the Canonical BSON and Extended JSON representations are included for comparison.

## Version

Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.

## Format

#### Top level keys

Each JSON file contains three top-level keys.

- `description`: human-readable description of what is in the file
- `test_key`: Field name used when decoding/encoding a BSON document containing the single BSON Binary for the test
case. Applies to *every* case.
- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional
binary and json encoding values.

#### Keys of tests objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should describe all the assertions that the test runner should make. Many are obvious, but one that might be missed, for example, is that you can round trip from vector to bson and then back to vector.

IIRC the bson_corpus README does this pretty well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is outdated now. If you need more, would you please be specific? I am also happy to revisit once another driver takes a stab at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in person, so I think we're in alignment now. The idea is to just look at all the assertions being done in the Python driver test runner, and write them down in prose here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2d4ea72 - Added Required Tests section to README. Removed JSON from tests.


- `description`: string describing the test.
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input.
- `vector`: list of numbers
- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27")
- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum.
- `padding`: (optional) integer for byte padding. Defaults to 0.
- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string.
- `canonical_extjson`: (required if valid is true) string containing a Canonical Extended JSON document. Because this is
itself embedded as a *string* inside a JSON document, characters like quote and backslash are escaped.
45 changes: 45 additions & 0 deletions source/bson-binary-vector/tests/float32.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector FLOAT32",
"valid": true,
"vector": [127.0, 7.0],
vbabanin marked this conversation as resolved.
Show resolved Hide resolved
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector FLOAT32",
"valid": true,
"vector": [],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009270000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Infinity Vector FLOAT32",
"valid": true,
"vector": ["-inf", 0.0, "inf"],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}"
},
{
"description": "FLOAT32 with padding",
"valid": false,
"vector": [127.0, 7.0],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 3
}
]
}

59 changes: 59 additions & 0 deletions source/bson-binary-vector/tests/int8.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype INT8",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector INT8",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000903007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector INT8",
"valid": true,
"vector": [],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009030000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Overflow Vector INT8",
"valid": false,
"vector": [128],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "Underflow Vector INT8",
"valid": false,
"vector": [-129],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "INT8 with padding",
"valid": false,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 3
},
{
"description": "INT8 with float inputs",
"valid": false,
"vector": [127.77, 7.77],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
}
]
}

53 changes: 53 additions & 0 deletions source/bson-binary-vector/tests/packed_bit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector PACKED_BIT",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000910007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector PACKED_BIT",
"valid": true,
"vector": [],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009100000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}"
},
{
"description": "PACKED_BIT with padding",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 3,
"canonical_bson": "1600000005766563746F7200040000000910037F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Overflow Vector PACKED_BIT",
"valid": false,
"vector": [256],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0
},
{
"description": "Underflow Vector PACKED_BIT",
"valid": false,
"vector": [-1],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0
}
vbabanin marked this conversation as resolved.
Show resolved Hide resolved
]
}

30 changes: 30 additions & 0 deletions source/bson-corpus/tests/binary.json
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,36 @@
"description": "$type query operator (conflicts with legacy $binary form with $type field)",
"canonical_bson": "180000000378001000000010247479706500020000000000",
"canonical_extjson": "{\"x\" : { \"$type\" : {\"$numberInt\": \"2\"}}}"
},
{
"description": "subtype 0x09 Vector FLOAT32",
"canonical_bson": "170000000578000A0000000927000000FE420000E04000",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}"
},
{
"description": "subtype 0x09 Vector INT8",
"canonical_bson": "11000000057800040000000903007F0700",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "subtype 0x09 Vector PACKED_BIT",
"canonical_bson": "11000000057800040000000910007F0700",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "subtype 0x09 Vector (Zero-length) FLOAT32",
"canonical_bson": "0F0000000578000200000009270000",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}"
},
{
"description": "subtype 0x09 Vector (Zero-length) INT8",
"canonical_bson": "0F0000000578000200000009030000",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}"
},
{
"description": "subtype 0x09 Vector (Zero-length) PACKED_BIT",
"canonical_bson": "0F0000000578000200000009100000",
"canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}"
}
],
"decodeErrors": [
Expand Down
1 change: 1 addition & 0 deletions source/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# MongoDB Specifications

- [BSON Binary Subtype 6](client-side-encryption/subtype6.md)
- [BSON Binary Subtype 9 - Vector](bson-binary-vector/bson-binary-vector.md)
- [BSON Corpus](bson-corpus/bson-corpus.md)
- [BSON Decimal128 Type Handling in Drivers](bson-decimal128/decimal128.md)
- [Causal Consistency Specification](causal-consistency/causal-consistency.md)
Expand Down
Loading