Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2926 BSON Binary Vector Subtype Support #1658

Merged

Conversation

caseyclements
Copy link
Contributor

@caseyclements caseyclements commented Sep 13, 2024

Adds support for new subtype of Binary BSON type to be used for densely-packed 1d arrays (vectors).

Corresponding changes for the PyMongo as reference implementation are here.

DRIVERS-2926

@caseyclements caseyclements requested a review from a team as a code owner September 13, 2024 21:23
@caseyclements caseyclements requested review from jyemin and removed request for a team September 13, 2024 21:23
@caseyclements
Copy link
Contributor Author

caseyclements commented Sep 13, 2024

Prose descriptions of tests that we suggest be added to each driver coming soon.

@caseyclements caseyclements marked this pull request as draft September 13, 2024 21:24
@caseyclements
Copy link
Contributor Author

Rough draft of prose tests added.
Guidance would be great!

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would benefit from a more precise set of test cases, and all the better if they are represented in YAML format like other unit tests in the specifications repository. I know that a set of tests have been implemented in pymongo, but they should all be specified here (i.e. an implementer shouldn't have to go spelunking in the pymongo source code to find all the necessary test cases).

There is no existing test format that does what we need, but I imagine a new set of tests could look something like this:

{
    "description: <some description>,
    "dtype" : "int8",
    "array" : [{ "$numberInt" : 42}, ...]
    "binaryAsBase64": "JwAAAP5CAADgQA==\"
}

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Jeff's suggestion for spec tests.

source/bson-corpus/binary-vector-subtype.md Outdated Show resolved Hide resolved
source/bson-corpus/binary-vector-subtype.md Outdated Show resolved Hide resolved
source/bson-corpus/binary-vector-subtype.md Outdated Show resolved Hide resolved
source/bson-corpus/binary-vector-subtype.md Outdated Show resolved Hide resolved
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Specification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec would be benefited by including a diagram or model of the format itself. For example the decimal128 spec defines "structs" to represent the data format.

It would also be great to include code examples in Python that show the expected use cases (python list -> bson vector and vice versa).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to oblige, but I'm unsure what you have in mind for the model.

As to code examples, it was suggested to me not to include python code. There's now a link in the reference implementation. Sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either add an image, e.g.

<img src="client-session-transaction-states.png"
, or ascii art (
mongodb://username:password@example.com:27017,example2.com:27017,...,example.comN:27017/database?key=value&keyN=valueN
). Here's an example of some implementation suggestion code: https://github.com/mongodb/specifications/blob/b5cbae0211d3d5049680cfc0803e4589474464bc/source/auth/auth.md#hostname-canonicalization

Copy link
Contributor Author

@caseyclements caseyclements Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is over the top. How about an example?

For example, a vector [6, 7] of dtype PACKED_BIT (\x10) with a padding of 3 would like like this:
b"\x10\x03\x06\x07': 1 byte for dtype, 1 for padding, and 1 for each uint8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write it as a markdown table and then explain the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShaneHarvey @blink1073 Please take a look now. I've added an html table.

## Reference Implementation

Please consult the Python driver's `pymongo.binary` module. Prose tests described below can be found in
`test.test_bson.TestBSON.test_vector`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to the PYTHON ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see an example of a ticket, but I did add a link to the github module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

source/bson-corpus/binary-vector-subtype.md Outdated Show resolved Hide resolved
@caseyclements caseyclements marked this pull request as ready for review September 20, 2024 14:35
Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just gave a first pass to the test runner README.

- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional
binary and json encoding values.

#### Keys of tests objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should describe all the assertions that the test runner should make. Many are obvious, but one that might be missed, for example, is that you can round trip from vector to bson and then back to vector.

IIRC the bson_corpus README does this pretty well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is outdated now. If you need more, would you please be specific? I am also happy to revisit once another driver takes a stab at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in person, so I think we're in alignment now. The idea is to just look at all the assertions being done in the Python driver test runner, and write them down in prose here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2d4ea72 - Added Required Tests section to README. Removed JSON from tests.


## Specification

This specification introduces a new BSON binary subtype, the vector, with value `"\x09"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs updating


### Binary structure

Following the binary subtype `\x09` a two-element byte array of metadata precedes the packed numbers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"9"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Reference Implementation

- PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file path might change, let's just use the full JIRA link to the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while
following idioms of the language of the driver.

Copy link
Member

@vbabanin vbabanin Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the server currently does not validate the metadata of vectors when they are inserted into the database. As a result, users can insert vectors with invalid metadata if they construct binary manually, such as setting the padding to -1 or applying padding to FLOAT32.

We do have tests in the specification that enforce validation in drivers' API to ensure that vectors with corrupted metadata never reach the server. However, the specification does not require drivers to validate vector metadata when the data is being read back from the database. This means that if the metadata becomes corrupted (e.g., a FLOAT32 vector with padding greater than 0), drivers could return a Vector object/structure/class from their API's to the user that contains invalid data and violates expected invariants.

Should we mention this in the spec API guidance and enforce validation when reading vectors from the database, ensuring that drivers check for invariant violations in the binary data? This would prevent returning invalid Vector structures to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin This is a good idea. Would you like to propose some prose for this? How about the following?

Implementing drivers MUST validate metadata. Padding must be 0 for all dtypes where padding does not apply. When padding is applicable, it must be in [0, 7]. If implementing INT4 valid values would only be {0, 4}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:

Drivers MUST validate vector metadata and raise an error if any invariant is violated:

  • Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
  • A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].

Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.

I didn’t include INT4 in validation requirements as it could be added later once it’s supported in the table of types.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation section was added in this commit: 0ccc39944

those provided in the JSON.
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
canonical_bson string.
- For floating point number types, numerical values need not match exactly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting a float32 which is already stored as an approximation, like 127.699..... for 127.7, in Java for example, to a 4-byte array and back, no additional approximation error should occur since it's a direct encoding and decoding of the binary representation. So precise matching in tests seems important for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself. Should we consider mentioning this in the test guidance for drivers that natively support float32?

Should we also specify a small margin of error for comparing floating-point values for drivers supporting only float64 where precision loss could happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

"For drivers that natively support types like float32, they SHOULD implement additional testing. For example, precise matching for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I think drivers that support float32 natively have to ensure precise conversion. How about extending it with the following wording?

Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I made an update. Please take a look.


Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while
following idioms of the language of the driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:

Drivers MUST validate vector metadata and raise an error if any invariant is violated:

  • Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
  • A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].

Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.

I didn’t include INT4 in validation requirements as it could be added later once it’s supported in the table of types.

What do you think?

those provided in the JSON.
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
canonical_bson string.
- For floating point number types, numerical values need not match exactly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I think drivers that support float32 natively have to ensure precise conversion. How about extending it with the following wording?

Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding.

What do you think?

### Encoding

```
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_vector currently only accepts raw vector parameters, while as_vector returns a Vector structure. To keep the API consistent, it would make sense for from_vector to accept a Vector structure directly, similar to how as_vector returns one. This would also reduce the need for manually extracting individual components when encoding a Vector.

Should from_vector also accept a Vector type as input? Perhaps this could be offered as an alternative method for drivers to implement.

Copy link
Contributor Author

@caseyclements caseyclements Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin. I completely agree. That was an oversight. What about this?

The above is suggestive only. If a driver chooses to implement a Vector type (or numerous)
they MAY decide that from_vector that has a single argument, a Vector.

Copy link
Contributor

@nbbeeken nbbeeken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

those provided in the JSON.
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
canonical_bson string.
- For floating point number types, numerical values need not match exactly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can actually be asserted about floats? Can we place any sort of mathematical limit on the non-exact match?

Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@caseyclements caseyclements merged commit 58de9a7 into mongodb:master Oct 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants