-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-2926 BSON Binary Vector Subtype Support #1658
DRIVERS-2926 BSON Binary Vector Subtype Support #1658
Conversation
Prose descriptions of tests that we suggest be added to each driver coming soon. |
Rough draft of prose tests added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would benefit from a more precise set of test cases, and all the better if they are represented in YAML format like other unit tests in the specifications repository. I know that a set of tests have been implemented in pymongo, but they should all be specified here (i.e. an implementer shouldn't have to go spelunking in the pymongo source code to find all the necessary test cases).
There is no existing test format that does what we need, but I imagine a new set of tests could look something like this:
{
"description: <some description>,
"dtype" : "int8",
"array" : [{ "$numberInt" : 42}, ...]
"binaryAsBase64": "JwAAAP5CAADgQA==\"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Jeff's suggestion for spec tests.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and | ||
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||
|
||
## Specification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spec would be benefited by including a diagram or model of the format itself. For example the decimal128 spec defines "structs" to represent the data format.
It would also be great to include code examples in Python that show the expected use cases (python list -> bson vector and vice versa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to oblige, but I'm unsure what you have in mind for the model.
As to code examples, it was suggested to me not to include python code. There's now a link in the reference implementation. Sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can either add an image, e.g.
<img src="client-session-transaction-states.png" |
mongodb://username:password@example.com:27017,example2.com:27017,...,example.comN:27017/database?key=value&keyN=valueN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is over the top. How about an example?
For example, a vector [6, 7] of dtype PACKED_BIT (\x10) with a padding of 3 would like like this:
b"\x10\x03\x06\x07'
: 1 byte for dtype, 1 for padding, and 1 for each uint8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not write it as a markdown table and then explain the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShaneHarvey @blink1073 Please take a look now. I've added an html table.
## Reference Implementation | ||
|
||
Please consult the Python driver's `pymongo.binary` module. Prose tests described below can be found in | ||
`test.test_bson.TestBSON.test_vector`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a link to the PYTHON ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see an example of a ticket, but I did add a link to the github module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just gave a first pass to the test runner README.
- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional | ||
binary and json encoding values. | ||
|
||
#### Keys of tests objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should describe all the assertions that the test runner should make. Many are obvious, but one that might be missed, for example, is that you can round trip from vector to bson and then back to vector.
IIRC the bson_corpus README does this pretty well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is outdated now. If you need more, would you please be specific? I am also happy to revisit once another driver takes a stab at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in person, so I think we're in alignment now. The idea is to just look at all the assertions being done in the Python driver test runner, and write them down in prose here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2d4ea72 - Added Required Tests section to README. Removed JSON from tests.
|
||
## Specification | ||
|
||
This specification introduces a new BSON binary subtype, the vector, with value `"\x09"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs updating
|
||
### Binary structure | ||
|
||
Following the binary subtype `\x09` a two-element byte array of metadata precedes the packed numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"9"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
## Reference Implementation | ||
|
||
- PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file path might change, let's just use the full JIRA link to the ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…mply references JIRA ticket.
|
||
Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while | ||
following idioms of the language of the driver. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the server currently does not validate the metadata of vectors when they are inserted into the database. As a result, users can insert vectors with invalid metadata if they construct binary manually, such as setting the padding to -1 or applying padding to FLOAT32.
We do have tests in the specification that enforce validation in drivers' API to ensure that vectors with corrupted metadata never reach the server. However, the specification does not require drivers to validate vector metadata when the data is being read back from the database. This means that if the metadata becomes corrupted (e.g., a FLOAT32 vector with padding greater than 0), drivers could return a Vector object/structure/class from their API's to the user that contains invalid data and violates expected invariants.
Should we mention this in the spec API guidance and enforce validation when reading vectors from the database, ensuring that drivers check for invariant violations in the binary data? This would prevent returning invalid Vector structures to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbabanin This is a good idea. Would you like to propose some prose for this? How about the following?
Implementing drivers MUST validate metadata. Padding must be 0 for all dtypes where padding does not apply. When padding is applicable, it must be in [0, 7]. If implementing INT4
valid values would only be {0, 4}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:
Drivers MUST validate vector metadata and raise an error if any invariant is violated:
- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
- A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].
Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.
I didn’t include INT4
in validation requirements as it could be added later once it’s supported in the table of types.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation section was added in this commit: 0ccc39944
those provided in the JSON. | ||
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the | ||
canonical_bson string. | ||
- For floating point number types, numerical values need not match exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When converting a float32 which is already stored as an approximation, like 127.699..... for 127.7, in Java for example, to a 4-byte array and back, no additional approximation error should occur since it's a direct encoding and decoding of the binary representation. So precise matching in tests seems important for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float
algorithm itself. Should we consider mentioning this in the test guidance for drivers that natively support float32?
Should we also specify a small margin of error for comparing floating-point values for drivers supporting only float64 where precision loss could happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
"For drivers that natively support types like float32, they SHOULD implement additional testing. For example, precise matching for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I think drivers that support float32
natively have to ensure precise conversion. How about extending it with the following wording?
Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. I made an update. Please take a look.
|
||
Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while | ||
following idioms of the language of the driver. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:
Drivers MUST validate vector metadata and raise an error if any invariant is violated:
- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
- A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].
Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.
I didn’t include INT4
in validation requirements as it could be added later once it’s supported in the table of types.
What do you think?
those provided in the JSON. | ||
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the | ||
canonical_bson string. | ||
- For floating point number types, numerical values need not match exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I think drivers that support float32
natively have to ensure precise conversion. How about extending it with the following wording?
Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding.
What do you think?
### Encoding | ||
|
||
``` | ||
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_vector
currently only accepts raw vector parameters, while as_vector
returns a Vector
structure. To keep the API consistent, it would make sense for from_vector
to accept a Vector
structure directly, similar to how as_vector
returns one. This would also reduce the need for manually extracting individual components when encoding a Vector.
Should from_vector
also accept a Vector
type as input? Perhaps this could be offered as an alternative method for drivers to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbabanin. I completely agree. That was an oversight. What about this?
The above is suggestive only. If a driver chooses to implement a Vector type (or numerous)
they MAY decide that from_vector that has a single argument, a Vector.
…the floating-point type being tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
those provided in the JSON. | ||
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the | ||
canonical_bson string. | ||
- For floating point number types, numerical values need not match exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can actually be asserted about floats? Can we place any sort of mathematical limit on the non-exact match?
…om_vector(Vector) -> Binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Adds support for new subtype of Binary BSON type to be used for densely-packed 1d arrays (vectors).
Corresponding changes for the PyMongo as reference implementation are here.
DRIVERS-2926