Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDRIVER-4363 add client bulk write #1590

Merged
merged 38 commits into from
May 16, 2024

Conversation

kevinAlbs
Copy link
Collaborator

@kevinAlbs kevinAlbs commented May 1, 2024

Summary

This PR implements the MongoClient.bulkWrite API proposed in: mongodb/specifications#1534

RST documentation was omitted from this PR in case the API changes during review. I plan to add RST documentation in a future PR.

Commits are separated to ease filtering the JSON test files and minor drive-by improvements. See the bulk-write.md file in mongodb/specifications#1534 for the expected behavior.

Testing

JSON files are copied from the mongodb/specifications#1534. These were reviewed in the spec PR and I do not expect need another review. test-mongoc-crud.c includes the prose tests from the specification. test-mongoc-bulkwrite.c includes libmongoc-specific tests.

Background

Terms

The following terms are used in this PR description:

  • MongoCollection.bulkWrite refers to the existing API from the CRUD specification. The C driver implements in terms of mongoc_bulk_operation_t.
  • MongoClient.bulkWrite refers to the new API in the new Bulk Write specification. This PR implements in terms of the new mongoc_bulkwrite_t handle.

MongoCollection.bulkWrite splits write operations by type, and sends separate insert, update, and delete server commands. MongoCollection.bulkWrite writes to only one collection.

MongoClient.bulkWrite uses the bulkWrite server command. The bulkWrite command supports insert, update, and delete operations for multiple namespaces. The bulkWrite command is introduced in server 8.0. MongoClient.bulkWrite can write to multiple collections.

Both bulk write APIs require splitting large writes into separate commands to stay within server size limits. E.g. a call to MongoClient.bulkWrite with many writes may result in multiple bulkWrite commands sent.

Rationale

The following describes rationale behind some decisions in this PR:

Models

The specification defines model classes for each write. Here is the DeleteOneModel definition (with comments removed for brevity):

class DeleteOneModel implements WriteModel {
    filter: Document;
    collation: Optional<Document>;
    hint: Optional<Document | String>;
}

This PR represents the required fields in arguments, and groups optional fields in an opaque opts struct:

MONGOC_EXPORT (bool)
mongoc_bulkwrite_append_deleteone (mongoc_bulkwrite_t *self,
                                   const char *ns,
                                   const bson_t *filter,
                                   mongoc_bulkwrite_deleteoneopts_t *opts /* May be NULL */,
                                   bson_error_t *error);

I thought this was a reasonable compromise between matching the spec and existing C driver patterns. If the spec later extends DeleteOneModel, mongoc_bulkwrite_deleteoneopts_t can be extended.

The existing C driver implementation of MongoCollection.bulkWrite uses bson_t for options:

MONGOC_EXPORT (bool)
mongoc_bulk_operation_remove_one_with_opts (mongoc_bulk_operation_t *bulk,
                                            const bson_t *selector,
                                            const bson_t *opts,
                                            bson_error_t *error); /* OUT */

However, I think a typed mongoc_bulkwrite_deleteoneopts_t is preferable to bson_t. It may ease development with compile time type checks and auto complete.

Naming

This PR uses a slightly different API naming pattern. Parts are logically separated by an underscore. Example: mongoc_bulkwriteopts_set_bypassdocumentvalidation.
This differs slightly from the established pattern of separating each word. Example in current API: mongoc_bulk_operation_set_bypass_document_validation.
I find the logical separation easier to read.

Validation

The existing C driver implementation of MongoCollection.bulkWrite includes default client-side validation:

validate: Construct a bitwise-or of all desired bson_validate_flags_t. Set to false to skip client-side validation of the provided BSON documents.

The default flags for writes are all the same:

const bson_validate_flags_t _mongoc_default_insert_vflags =
   BSON_VALIDATE_UTF8 | BSON_VALIDATE_UTF8_ALLOW_NULL |
   BSON_VALIDATE_EMPTY_KEYS;

UTF-8 validation appears broken. Passing invalid UTF-8 for string values does not result in error. See: CDRIVER-4448. The default validation only enables BSON_VALIDATE_EMPTY_KEYS to check there are no empty keys (e.g. {"": "foo"}).

CDRIVER-3731 is a request to configure validation at a broader scope. Quoting jeroen/mongolite#206:

It probably makes sense to disable this globally

The C driver benchmarks disable the default validation. I re-enabled the validation and reran the benchmark. This resulted in significant performance drops in the base comparison:

Performance regression with validation enabled

Since UTF-8 validation is already broken, there are requests to disable, validation negatively impacts performance, and it is not a spec requirement, this PR excludes default validation in MongoClient.bulkWrite.

This is a drive-by fix to fix the referenced payload numbers.
Otherwise, if the type mismatches, the test runner aborts with an error: "expected int64, int32, or double [...]" that does not identify the field.
To do exact BSON matches
To ease reading error message
include some tests of basic usage and libmongoc-specific behavior. Specification and prose tests test driver-agnostic behavior.
@kevinAlbs kevinAlbs marked this pull request as ready for review May 1, 2024 20:03
Comment on lines 572 to 573
if (write_errors) {
if (write_errors && !result->write_errors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the write_errors in the nested if is redundant.

bson_oid_init (&oid, NULL);
BSON_ASSERT (BSON_APPEND_OID (&tmp, "_id", &oid));
BSON_ASSERT (bson_concat (&tmp, document));
BSON_ASSERT (bson_append_document (&op, "document", 8, &tmp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hardcoding a string length, I find it more maintainable to use something like strlen("document") instead. The compiler should optimize away the call to strlen. See this example on godbolt https://godbolt.org/z/bhzoddYKo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHP internals often does sizeof(<str literal>) - 1, but strlen() is certainly more concise if it gets compiled the same.

There also used to be a PHP macro for expanding a string literal into two arguments (the string and length), but it had a cryptic name and casual readers might not realize it expands and assume a parameter is missing. If that sounds useful, readability could probably be addressed with a more verbose name (e.g. STR_AND_LEN()). The main benefit is avoiding duplication (be it the original string or its length).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler should optimize away the call to strlen.

Neat, thank you for the example. Updated to use the BSON_APPEND_* macros to remove the hard coded lengths. The macros add a strlen call.

}

mongoc_bulkwritereturn_t
mongoc_bulkwrite_execute (mongoc_bulkwrite_t *self, const mongoc_bulkwriteopts_t *opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is nearly 600 lines long. Is it possible to break this function up into smaller functions? The general rule of thumb that I use is to make functions that fit in a single screen size without needing to scroll (approximately 80 lines).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to split into two new helpers: _bulkwritereturn_apply_result and _bulkwritereturn_apply_reply.

An attempt to more aggressively split did not improve readability IMO. I think separate functions are helpful when it is "easy" to reason about what state is changed. Result handling seemed like an improvement, since only the mongoc_bulkwritereturn_t is updated.

Though this function is still long, block scopes with comments were used to logically group. Code folding can collapse blocks when reading:

code folding

tf->maxMessageSizeBytes = sl.maxMessageSizeBytes;
tf->maxBsonObjectSize = sl.maxBsonObjectSize;

const int32_t opsBytes = tf->maxMessageSizeBytes - 1122;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment the meaning of 1122 or create a constant with a meaningful name and assign this value to it. Please do the same for other locations where magic numbers are used, e.g., 57, 217, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are copied from the prose test. The prose test describes the calculation.

Added a comment referring to the prose tests.

src/libmongoc/tests/unified/runner.c Outdated Show resolved Hide resolved
src/libmongoc/tests/TestSuite.c Show resolved Hide resolved
src/libmongoc/tests/TestSuite.h Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-write-command.c Outdated Show resolved Hide resolved
src/libmongoc/tests/bsonutil/bson-val.c Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
@kevinAlbs kevinAlbs removed the request for review from vector-of-bool May 10, 2024 20:32
Array filters do not apply to replace. Remove to match spec.
The return was not being set. Callers can determine if the write was acknowledged by checking if `mongoc_bulkwritereturn_t::res` is NULL.
Copy link
Collaborator Author

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes verified with this patch build.

src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Outdated Show resolved Hide resolved
src/libmongoc/src/mongoc/mongoc-bulkwrite.c Show resolved Hide resolved
bson_oid_init (&oid, NULL);
BSON_ASSERT (BSON_APPEND_OID (&tmp, "_id", &oid));
BSON_ASSERT (bson_concat (&tmp, document));
BSON_ASSERT (bson_append_document (&op, "document", 8, &tmp));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler should optimize away the call to strlen.

Neat, thank you for the example. Updated to use the BSON_APPEND_* macros to remove the hard coded lengths. The macros add a strlen call.

src/libmongoc/src/mongoc/mongoc-write-command.c Outdated Show resolved Hide resolved
src/libmongoc/tests/TestSuite.c Show resolved Hide resolved
src/libmongoc/tests/bsonutil/bson-val.c Show resolved Hide resolved
tf->maxMessageSizeBytes = sl.maxMessageSizeBytes;
tf->maxBsonObjectSize = sl.maxBsonObjectSize;

const int32_t opsBytes = tf->maxMessageSizeBytes - 1122;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are copied from the prose test. The prose test describes the calculation.

Added a comment referring to the prose tests.

src/libmongoc/tests/unified/runner.c Outdated Show resolved Hide resolved
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed requested changes from my previous review. Note that I did not diligently examine the test changes (just the public API we'd end up using in PHPC).

Copy link
Contributor

@kkloberdanz kkloberdanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

mongoc_bulkwriteopts_set_let (mongoc_bulkwriteopts_t *self, const bson_t *let)
{
BSON_ASSERT_PARAM (self);
BSON_ASSERT (let || true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not introduce more instances of this funny line. I think we need something like:

#define BSON_OPTIONAL_PARAM (x) (void)(x)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added BSON_OPTIONAL_PARAM.

@kevinAlbs kevinAlbs merged commit 84728bd into mongodb:master May 16, 2024
41 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants