Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHPLIB-1182: Support codec option in operation classes #1140

Merged
merged 37 commits into from
Aug 25, 2023

Conversation

alcaeus
Copy link
Member

@alcaeus alcaeus commented Jul 24, 2023

PHPLIB-1182

This PR adds support for a codec option to the MongoDB\Collection class as well as the following operations:

  • aggregate
  • bulkWrite (for insertOne and replaceOne operations)
  • find
  • findOne
  • findOneAndDelete (returned documents only)
  • findOneAndUpdate (returned documents only)
  • findOneAndReplace (replacement document and returned documents)
  • insertMany
  • insertOne
  • replaceOne
  • watch

When specifying a codec, the typeMap option is ignored entirely, and a bson type is applied for the root element. The collection class supports a class-level codec option, which will be applied when not explicitly specifying a codec option to any of the above operations. Specifying a null value for the codec option allows users to disable using the collection-level codec and instead rely on type maps to deserialise data.

@alcaeus alcaeus requested review from jmikola and GromNaN July 24, 2023 08:58
@alcaeus alcaeus self-assigned this Jul 24, 2023
src/Collection.php Outdated Show resolved Hide resolved
src/Collection.php Show resolved Hide resolved
src/GridFS/ReadableStream.php Outdated Show resolved Hide resolved
src/Model/ChangeStreamIterator.php Outdated Show resolved Hide resolved
src/Model/ChangeStreamIterator.php Show resolved Hide resolved
src/ChangeStream.php Outdated Show resolved Hide resolved
@@ -188,7 +203,7 @@ public function __toString()
* @see Aggregate::__construct() for supported options
* @param array $pipeline Aggregation pipeline
* @param array $options Command options
* @return Cursor
* @return CursorInterface&Iterator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While technically a BC break, this is necessary for us to allow returning a custom iterator class. The Cursor class implements both interfaces, so the public API does not change. If users explicitly check the returned class type, this may cause unintended behaviour.

$resumeToken = is_array($document)
? ($document['_id'] ?? null)
: ($document->_id ?? null);
if ($document instanceof Document) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support watch operations with a codec supplied, so we can avoid messing with the type map on every iteration.

Comment on lines +254 to +265
if ($resumeToken instanceof Document) {
$resumeToken = $resumeToken->toPHP();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a Document instance as resume token would be valid, but not converting it to PHP causes test failures. In our tests, we extract the resume token from the command response, which is always in PHP format. To avoid issues when testing watch operations with a codec, this is always converted to a PHP value here as well.

src/Model/CodecCursor.php Show resolved Hide resolved
src/ChangeStream.php Outdated Show resolved Hide resolved
src/Collection.php Show resolved Hide resolved
src/Collection.php Outdated Show resolved Hide resolved
src/Operation/BulkWrite.php Outdated Show resolved Hide resolved
src/Operation/Find.php Show resolved Hide resolved
src/Operation/FindAndModify.php Outdated Show resolved Hide resolved
src/Operation/FindOneAndReplace.php Outdated Show resolved Hide resolved
src/Operation/InsertMany.php Outdated Show resolved Hide resolved
src/Operation/Watch.php Show resolved Hide resolved
src/Model/ChangeStreamIterator.php Outdated Show resolved Hide resolved
src/Model/ChangeStreamIterator.php Show resolved Hide resolved
src/Model/CodecCursor.php Show resolved Hide resolved
src/Operation/FindOneAndReplace.php Outdated Show resolved Hide resolved
src/Operation/BulkWrite.php Outdated Show resolved Hide resolved
tests/Fixtures/Document/TestObject.php Outdated Show resolved Hide resolved
tests/Operation/FindOneAndReplaceFunctionalTest.php Outdated Show resolved Hide resolved
tests/Operation/InsertManyFunctionalTest.php Show resolved Hide resolved
tests/Collection/CodecCollectionFunctionalTest.php Outdated Show resolved Hide resolved
src/Operation/BulkWrite.php Outdated Show resolved Hide resolved
src/Operation/BulkWrite.php Outdated Show resolved Hide resolved
src/Operation/FindOneAndReplace.php Outdated Show resolved Hide resolved
src/Operation/InsertMany.php Outdated Show resolved Hide resolved
src/Operation/InsertOne.php Outdated Show resolved Hide resolved
src/Operation/ReplaceOne.php Outdated Show resolved Hide resolved
tests/Model/CodecCursorFunctionalTest.php Outdated Show resolved Hide resolved
@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from 9925bbd to 83b38db Compare August 2, 2023 07:38
@alcaeus
Copy link
Member Author

alcaeus commented Aug 2, 2023

@jmikola I've changed the handling in operations that write (BulkWrite, FindOneAndReplace, InsertOne, InsertMany, and ReplaceOne):

  • Encoding is now always done through encode instead of encodeIfSupported - when a codec is involved we're expecting that codec to be able to handle incoming values.
  • Encoding is now done before all other checks, despite having knowledge that encode will always return a Document (so checks like is_update_pipeline will return false). For replace operations the check for atomic update operators is still relevant, so I decided it was easiest to encode first, check later.

The first change leaves open how we want to proceed with decoding values: we may want to use decode instead of decodeIfSupported - if nothing else the user then has a guarantee that the return values are of the type they expect, instead of having to account for the possibility of the codec not being able to decode a given value and returning it as-is. If you agree I can make that change as well.

tests/Operation/ReplaceOneTest.php Show resolved Hide resolved
tests/Operation/InsertOneTest.php Outdated Show resolved Hide resolved
tests/Operation/InsertManyTest.php Outdated Show resolved Hide resolved
src/Operation/ReplaceOne.php Outdated Show resolved Hide resolved
src/Operation/BulkWrite.php Show resolved Hide resolved
src/Collection.php Outdated Show resolved Hide resolved

$operation = new Count($this->databaseName, $this->collectionName, $filter, $options);

return $operation->execute($server);
return $operation->execute(select_server($this->manager, $options));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle behavioral change because we're now deferring server selection until after constructing the Count operation, but I'll note that other helper methods vary in this regard.

For examples, helpers like aggregate() must select servers in advance to handle option inheritance.

I think it does make sense to defer server selection as long as possible, since it makes it easier to catch other client-side validation errors sooner.

src/Collection.php Outdated Show resolved Hide resolved
src/Collection.php Outdated Show resolved Hide resolved
src/Collection.php Show resolved Hide resolved
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first change leaves open how we want to proceed with decoding values: we may want to use decode instead of decodeIfSupported - if nothing else the user then has a guarantee that the return values are of the type they expect, instead of having to account for the possibility of the codec not being able to decode a given value and returning it as-is. If you agree I can make that change as well.

My initial reaction is that strict enforcement with decode() is preferable, but that may run afoul of the Robustness Principle. Relying on encode() ensures that we emit well-formed values instead of leaving garbage as-is (although it'd probably fail later validation if important).

For decoding, the risk seems to be that the server returns something the codec didn't expect. Today, users would just get those values as-is and a typeMap with field paths that don't exist might just NOP. If that sounds like behavior we should preserve then I'd be on board with decodeIfSupported().

But to argue once more in the other direction: if the codec we expect to be used is something quite generic like DocumentCodec, then I don't see much harm in using decode().

Sorry for bouncing back and forth here, but I'm struggling to conceive of all of the ways things could possibly go wrong with either approach. Maybe we should punt this to a video chat as well.

@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from 72077ff to c3a3164 Compare August 4, 2023 07:35
@jmikola
Copy link
Member

jmikola commented Aug 4, 2023

Discussed this over video, and we concluded that decode() should be used instead of decodeIfSupported(). This will ensure that something like a Cursor doesn't end up with mixed decoded values and Document instances. This poses no problems for the default DocumentCodec implementation and other implementations shouldn't be determining codec support based on things like fields existing or not. So an exception for a unsupported type would only happen if the server response is completely unexpected, as we historically have done in operations like FindAndModify.

@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from 8f634dd to 7a4b42c Compare August 4, 2023 15:54
);
}

public function provideFindOneAndModifyOptions(): Generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test for the issue I mentioned in https://github.com/mongodb/mongo-php-library/pull/1140/files#r1284677901. You'd need some findAndModify options to result in value being unset on the response.

See findAndModify: value for two possible conditions to trigger that response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests that use the "return null if nothing was matched" logic to ensure we're returning the correct values. Thank you for highlighting this once again as I missed this the first time around.

Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the two phpcs errors.

@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from 7a4b42c to aed2665 Compare August 16, 2023 11:08
@alcaeus alcaeus requested a review from jmikola August 16, 2023 11:09
@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from aed2665 to 325165a Compare August 16, 2023 11:49
@@ -301,7 +302,7 @@ public function drop()
* @see Find::__construct() for supported options
* @param array|object $filter Query by which to filter documents
* @param array $options Additional options
* @return Cursor
* @return CursorInterface&Iterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Collection instance is created internally, so I don't see how someone could pass on a codec to get a different result. The return type will always be Cursor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $options passed to Bucket::find are passed down to Collection::find as-is, so a user could introduce a codec that way, which would result in a return value of a type other than Cursor.

That said, this makes me wonder whether we should directly support a codec option in the Bucket class as well (similar to how typeMap is already supported). I haven't considered that until now, so if we decide to go that route I'd do so in a separate ticket and pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $options passed to Bucket::find are passed down to Collection::find as-is, so a user could introduce a codec that way, which would result in a return value of a type other than Cursor.

Noted that this explains the CursorInterface&Iterator return type being added here.

That said, this makes me wonder whether we should directly support a codec option in the Bucket class as well (similar to how typeMap is already supported).

The typeMap option is only used for getFileDocumentForStream() and getFileIdForStream(), and uploadFromStream() indirectly since that calls getFileIdForStream().

I think it does make sense to support codec there as we do typeMap today. Please create a separate ticket for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created PHPLIB-1220.

@@ -47,7 +49,7 @@
*
* @internal
* @template TValue of array|object
* @template-extends IteratorIterator<int, TValue, Cursor<TValue>>
* @template-extends IteratorIterator<int, TValue, Iterator<int, TValue>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking, psalm seems to accept union types here.

Suggested change
* @template-extends IteratorIterator<int, TValue, Iterator<int, TValue>>
* @template-extends IteratorIterator<int, TValue, CursorInterface&Iterator<int, TValue>>

This commit also refactors the logic of inheriting collection-level options to operations to reduce code duplication.
This commit also changes the behaviour to use encode() instead of encodeIfSupported(), requiring documents to be encodable by the given codec in order to be inserted/updated.
@alcaeus alcaeus force-pushed the phplib-1182-codec-operations branch from 325165a to 0ed7f67 Compare August 22, 2023 12:50
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested exception and new test for Watch.


public function testFindOneAndDeleteNothingWithCodec(): void
{
// When the query does not match any documents, the operation returns null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these tests, per the recent change to FindAndModify to directly return null instead of calling decode().

@@ -199,6 +206,10 @@ public function __construct(Manager $manager, ?string $databaseName, ?string $co
'readPreference' => new ReadPreference(ReadPreference::PRIMARY),
];

if (isset($options['codec']) && ! $options['codec'] instanceof DocumentCodec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previously resolved thread (https://github.com/mongodb/mongo-php-library/pull/1140/files#r1272549069), we talked about why codec is not passed down to Aggregate while typeMap is.

Looking at this once more alongside the changes in ChangeStream to apply the codec upon construction and when resuming, I think there's an edge case where typeMap and codec are both specified as a Watch option. If that happens, I imagine the typeMap will just get ignored when the ChangeStream is constructed and modifies its inner iterator (i.e. the cursor).

We prohibit that scenario for other operations, such as Find:

if (isset($options['codec']) && isset($options['typeMap'])) {
    throw InvalidArgumentException::cannotCombineCodecAndTypeMap();
}

Should we do so later in this constructor and add a test in WatchTest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tested to confirm it works as expected with codec AND typeMap specified. I agree that specifying both options should be prohibited as we do in other operations. Added the exception and a test as suggested.

/**
* @internal
*
* @param ResumeCallable $resumeCallable
*/
public function __construct(ChangeStreamIterator $iterator, callable $resumeCallable)
public function __construct(ChangeStreamIterator $iterator, callable $resumeCallable, ?DocumentCodec $codec = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch constructs the ChangeStreamIterator and has access to its internal cursor, even though it currently sends it as-is directly from the private executeAggregate() method. Watch also constructs the $resumeCallable here.

I was going to ask why ChangeStream bothers with calling setTypeMap(['root' => 'bson']) when that could otherwise be done in Watch; however, even if we did move that logic to Watch, ChangeStream would still need to know about the codec in order to process return values for current().

And you previously explained that Watch cannot apply the codec itself since doing so might interfere with resume token collection.

I think this is fine as-is but just want to confirm with you before signing off. Feel free to resolve after reading if my understanding is correct.

Another reason for asking was that ChangeStream is in the public API (although undocumented), so I'm paying extra attention the signature change above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ChangeStream needs to be in control of the codec used in the cursor (as it needs to be sure about the return types), I figured it would make more sense for ChangeStream to apply its codec to the underlying cursor instead of relying on that being set correctly. As is, Watch is only accepting the codec option on behalf of ChangeStream, and passing it on without needing to know about internal workings of ChangeStream.

The main problem with applying the codec in the original Aggregate operation is that while the aggregation pipeline results will always have an _id field to be used as resume token, we cannot make such guarantees after the codec has been applied. After that, the _id field may have been discarded or assigned to an object property we don't know about.

Also note that while ChangeStream and ChangeStreamIterator are part of the public API, both their constructors are marked as internal; indicating that while we expect users to interact with these classes, we don't expect them to instantiate them on their own. The signature change here is backward compatible, as the $codec parameter defaults to null, so any previously working instantiations of that class will continue to function as expected.

@alcaeus alcaeus merged commit 3cd7091 into mongodb:master Aug 25, 2023
43 checks passed
@alcaeus alcaeus deleted the phplib-1182-codec-operations branch August 25, 2023 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants