Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHPLIB-1181: Introduce lazy BSON classes #1135

Closed
wants to merge 27 commits into from

Conversation

alcaeus
Copy link
Member

@alcaeus alcaeus commented Jul 19, 2023

PHPLIB-1181

This PR introduces two new classes, LazyBSONDocument and LazyBSONArray. They behave like their non-lazy counterparts, except that they are backed by BSON structures. When reading data, only requested data is read from the BSON and returned. This allows for more efficient handling of large data structures. When changing data in an instance of these classes, the changes are tracked and handled separately when the structure is once again encoded to BSON. Encoding and decoding data is handled by two separate codecs that create the lazy structures and inject an appropriate codec library into them. This is necessary as the rest of the data is only decoded when accessed, so the lazy structure needs to be aware of the codec.

This PR also includes an ArrayCodec and ObjectCodec, which are used to traverse arrays and stdClass instances to encode/decode their data. Decoding will most likely not happen as we expect to be decoding from BSON, but encoding is much more likely when a user creates a lazy BSON structure manually:

$data = new LazyBSONDocument(['tags' => ['foo', 'bar']]);

When encoding this data, we need to traverse through the individual tags, necessitating the addition of codecs that handle arrays. Note that for simplicity, the codecs only access values. The structure is returned as it was (i.e. hashes are still returned as arrays), and keys are not changed at all. This makes them more suitable for other uses as well.

@alcaeus alcaeus requested review from jmikola and GromNaN July 19, 2023 12:33
@alcaeus alcaeus self-assigned this Jul 19, 2023
@alcaeus alcaeus force-pushed the phplib-1181-lazy-bson-classes branch from e7c3a8b to 013195f Compare July 19, 2023 12:36
src/Codec/ArrayCodec.php Show resolved Hide resolved
src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
Copy link
Member

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not yet clear to me how it's going to be used. Can you add an example or a functional test?

src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
src/Model/AsListIterator.php Outdated Show resolved Hide resolved
tests/Codec/LazyBSONCodecLibraryTest.php Outdated Show resolved Hide resolved
src/Model/LazyBSONDocument.php Outdated Show resolved Hide resolved
src/Model/LazyBSONDocument.php Outdated Show resolved Hide resolved
src/Model/LazyBSONDocument.php Show resolved Hide resolved
src/Model/LazyBSONDocument.php Outdated Show resolved Hide resolved
@alcaeus alcaeus force-pushed the phplib-1181-lazy-bson-classes branch from 013195f to b12055c Compare July 20, 2023 07:12
@alcaeus
Copy link
Member Author

alcaeus commented Jul 20, 2023

It's not yet clear to me how it's going to be used. Can you add an example or a functional test?

I will once the next PR is done, which will add codec support to various operation classes. The intended usage will be:

$collection = $client->selectCollection('db', 'Coll', ['codec' => new LazyBSONDocumentCodec()]);
$results = $collection->find();

When iterating through the results, instead of reading BSON and decoding it to PHP objects according to a type map, the data will be returned as a LazyBSONDocument which only reads data from BSON once it's accessed. More usage examples will follow once that PR is done, at which point I'll also add a tutorial on how to map data to custom classes using this method.

@alcaeus alcaeus force-pushed the phplib-1181-lazy-bson-classes branch 5 times, most recently from 0e8159a to a9e219b Compare July 21, 2023 11:22
@alcaeus alcaeus force-pushed the phplib-1181-lazy-bson-classes branch 2 times, most recently from 8960c4f to d741a6e Compare July 24, 2023 07:11
@alcaeus alcaeus force-pushed the phplib-1181-lazy-bson-classes branch from d741a6e to 6708343 Compare July 24, 2023 07:26
Copy link
Member

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/Codec/ArrayCodec.php Outdated Show resolved Hide resolved
src/Codec/ArrayCodec.php Outdated Show resolved Hide resolved
src/Codec/ArrayCodec.php Outdated Show resolved Hide resolved
src/Codec/LazyBSONArrayCodec.php Show resolved Hide resolved
src/Codec/LazyBSONArrayCodec.php Show resolved Hide resolved
tests/Model/LazyBSONArrayTest.php Show resolved Hide resolved
tests/Model/LazyBSONArrayTest.php Show resolved Hide resolved
tests/Model/LazyBSONArrayTest.php Outdated Show resolved Hide resolved
tests/Model/LazyBSONArrayTest.php Outdated Show resolved Hide resolved
tests/Model/LazyBSONDocumentTest.php Outdated Show resolved Hide resolved
@alcaeus alcaeus requested a review from jmikola July 31, 2023 09:21
$this->bson = PackedArray::fromPHP([]);
} elseif ($input instanceof PackedArray) {
$this->bson = $input;
} elseif (is_array($input)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the class to accept lists with gaps

The logic below is technically more permissive than just a "list with gaps", as you're only checking that keys are numeric. There's no enforcement that each key is greater than the previous. That seems trivial (both in terms of code required and overhead) to implement if you think it's worthwhile.

I realize this isn't required, since the max() logic for appending an element still considers all keys. The existing behavior is consistent with what PHP does for arrays: https://3v4l.org/ZI423

tests/Model/LazyBSONDocumentTest.php Outdated Show resolved Hide resolved
unset($document->foo);
$this->assertFalse(isset($document->foo));

// Set new value to ensure unset also clears values not read from BSON
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test overwrites $document->document. I realize that differs from the previous test a few lines up that unsets the $document->foo property, but the comment "clears values not read from BSON" above seems like it would equally apply to the first test in this method, where you assign and unset $document->new.

Would a better description here be that you're testing overwriting an existing field, and having unset() apply to both the new and original values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Modified to read:

Set new value to ensure unsetting a previously modified value does not fall back to loading values from BSON

tests/Model/LazyBSONDocumentTest.php Outdated Show resolved Hide resolved
tests/Model/LazyBSONDocumentTest.php Outdated Show resolved Hide resolved
src/Model/LazyBSONArray.php Outdated Show resolved Hide resolved
src/Model/LazyBSONArray.php Show resolved Hide resolved
src/Model/LazyBSONArray.php Show resolved Hide resolved
src/Codec/ObjectCodec.php Show resolved Hide resolved
src/Codec/ArrayCodec.php Show resolved Hide resolved
$this->bson = PackedArray::fromPHP([]);
} elseif ($input instanceof PackedArray) {
$this->bson = $input;
} elseif (is_array($input)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon the correct behavior would be to implement these like SplDoubleLinkedList, where you could repeatedly call offsetUnset() on index zero to empty out the list (each removal shifts subsequent indexes over). That said, I'm sure this would be unexpected for most users, a wild BC break for BSONArray, and would cause more harm than good we made BSONArray and LazyBSONArray inconsistent in this regard. Also, PackedArray is read-only, so the question never comes up there. Moving on...

I understand that we're stuck with weird edge cases because the array is a hash map, but I'm not convinced that ksort() is the right approach. Passing in arrays with backwards key sequences is admittedly an edge case, but I think users would expect LazyBSONArray to behave like BSONArray and defer any re-ordering until serialization time.

Consider:

$ba = new MongoDB\Model\BSONArray([]);

$ba[1] = 'foo';
$ba[0] = 'bar';

var_dump($ba);
var_dump($ba->jsonSerialize());

This outputs:

object(MongoDB\Model\BSONArray)#2 (1) {
  ["storage":"ArrayObject":private]=>
  array(2) {
    [1]=>
    string(3) "foo"
    [0]=>
    string(3) "bar"
  }
}
array(2) {
  [0]=>
  string(3) "foo"
  [1]=>
  string(3) "bar"
}

Iteration (and a straight var_dump()) returns the values in the order they were set with no consideration of keys. Actual serialization re-indexes via array_values().

Perhaps the real change I'm after would be requiring a list array for both BSONArray and LazyBSONArray. Does that seem untenable? Likely a BC break for BSONArray, so I'd be willing to defer it to 2.0 -- but having users explicitly call array_values() when needed seems preferable to make them aware that they're leaving themselves open to some strange/unexpected behavior.

* Model class for a BSON array.
*
* The internal data will be filtered through array_values() during BSON
* serialization to ensure that it becomes a BSON array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like copypasta from BSONArray since array_values() isn't used within. I think this should be written to mention that the class is intended to use used with the corresponding codec.

* Model class for a BSON document.
*
* The internal data will be cast to an object during BSON serialization to
* ensure that it becomes a BSON document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like LazyBSONArray, this comment appears to have been copied from the BSONDocument class. I think it needs to be rewritten. If these classes aren't going to directly implement our BSON Serializable interface, then I think this should just mention that these are only intended to be used with the corresponding Codec classes.

public function getIterator(): ListIterator
{
// Sort keys to ensure they are in ascending order
ksort($this->set, SORT_NUMERIC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. I think sorting here might create differing behavior between BSONArray and LazyBSONArray. Deferring any reordering until serialization time seems more consistent.

If these Lazy classes aren't going to implement BSON\Serializable and must rely on the codec for encoding, then I don't think we need to sort here. LazyBSONArrayCodec::encode() already disregards keys by using $return[] to build up the list used for constructing a PackedArray. This sorting appears to be here for the benefit of jsonSerialize(), which currently only uses iterator_to_array().

If the user constructs a LazyBSONArray with an out-of-sequence PHP array (e.g. [1 => 'foo', 0 => 'bar'], then I'd expect iteration to return the values "foo" and "bar" in sequence based on my previous experience with BSONArray, ArrayObject, and basic arrays. Moreover, BSONArray's BSON and JSON encoding would yield the sequence ["foo", "bar"]. That's the existing behavior users get with BSONArray.

I think the common rule with both classes should be that "keys do not dictate list order" if we aim to be consistent.

And maybe that's the root question here: are we prioritizing consistency with BSONArray and PHP arrays, or are we trying to do something new with LazyBSONArray?

The current implementation of this class feels more like a mutable version of PackedArray, and I think that's the reason we're struggling so much. SplDoublyLinkedList can function as a mutable PackedArray, but trying to provide the sloppy ArrayObject or array API is inviting all sorts of edge cases.

$array[1] = 'bar';
$array[0] = 'baz';

// Since BSONArray uses array_values internally, the reverse order is expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my earlier comments, I'm not sure I'd refer to this as "reverse order". Rather, I'd say that "keys are disregarded" as the array's natural order is used.


public function jsonSerialize(): array
{
return iterator_to_array($this->getIterator());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove key sorting from getIterator(), I think it'd make sense to specify $preserve_keys = false here (see: iterator_to_array()) to ensure we reindex.

@alcaeus alcaeus marked this pull request as draft August 4, 2023 14:49
@alcaeus alcaeus closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants