-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHPLIB-1181: Introduce lazy BSON classes #1135
Conversation
e7c3a8b
to
013195f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not yet clear to me how it's going to be used. Can you add an example or a functional test?
013195f
to
b12055c
Compare
I will once the next PR is done, which will add codec support to various operation classes. The intended usage will be:
When iterating through the results, instead of reading BSON and decoding it to PHP objects according to a type map, the data will be returned as a LazyBSONDocument which only reads data from BSON once it's accessed. More usage examples will follow once that PR is done, at which point I'll also add a tutorial on how to map data to custom classes using this method. |
0e8159a
to
a9e219b
Compare
…Document from an object
8960c4f
to
d741a6e
Compare
d741a6e
to
6708343
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
$this->bson = PackedArray::fromPHP([]); | ||
} elseif ($input instanceof PackedArray) { | ||
$this->bson = $input; | ||
} elseif (is_array($input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the class to accept lists with gaps
The logic below is technically more permissive than just a "list with gaps", as you're only checking that keys are numeric. There's no enforcement that each key is greater than the previous. That seems trivial (both in terms of code required and overhead) to implement if you think it's worthwhile.
I realize this isn't required, since the max()
logic for appending an element still considers all keys. The existing behavior is consistent with what PHP does for arrays: https://3v4l.org/ZI423
tests/Model/LazyBSONDocumentTest.php
Outdated
unset($document->foo); | ||
$this->assertFalse(isset($document->foo)); | ||
|
||
// Set new value to ensure unset also clears values not read from BSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test overwrites $document->document
. I realize that differs from the previous test a few lines up that unsets the $document->foo
property, but the comment "clears values not read from BSON" above seems like it would equally apply to the first test in this method, where you assign and unset $document->new
.
Would a better description here be that you're testing overwriting an existing field, and having unset()
apply to both the new and original values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Modified to read:
Set new value to ensure unsetting a previously modified value does not fall back to loading values from BSON
$this->bson = PackedArray::fromPHP([]); | ||
} elseif ($input instanceof PackedArray) { | ||
$this->bson = $input; | ||
} elseif (is_array($input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon the correct behavior would be to implement these like SplDoubleLinkedList, where you could repeatedly call offsetUnset() on index zero to empty out the list (each removal shifts subsequent indexes over). That said, I'm sure this would be unexpected for most users, a wild BC break for BSONArray, and would cause more harm than good we made BSONArray and LazyBSONArray inconsistent in this regard. Also, PackedArray is read-only, so the question never comes up there. Moving on...
I understand that we're stuck with weird edge cases because the array is a hash map, but I'm not convinced that ksort()
is the right approach. Passing in arrays with backwards key sequences is admittedly an edge case, but I think users would expect LazyBSONArray to behave like BSONArray and defer any re-ordering until serialization time.
Consider:
$ba = new MongoDB\Model\BSONArray([]);
$ba[1] = 'foo';
$ba[0] = 'bar';
var_dump($ba);
var_dump($ba->jsonSerialize());
This outputs:
object(MongoDB\Model\BSONArray)#2 (1) {
["storage":"ArrayObject":private]=>
array(2) {
[1]=>
string(3) "foo"
[0]=>
string(3) "bar"
}
}
array(2) {
[0]=>
string(3) "foo"
[1]=>
string(3) "bar"
}
Iteration (and a straight var_dump()
) returns the values in the order they were set with no consideration of keys. Actual serialization re-indexes via array_values()
.
Perhaps the real change I'm after would be requiring a list array for both BSONArray and LazyBSONArray. Does that seem untenable? Likely a BC break for BSONArray, so I'd be willing to defer it to 2.0 -- but having users explicitly call array_values()
when needed seems preferable to make them aware that they're leaving themselves open to some strange/unexpected behavior.
* Model class for a BSON array. | ||
* | ||
* The internal data will be filtered through array_values() during BSON | ||
* serialization to ensure that it becomes a BSON array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like copypasta from BSONArray since array_values()
isn't used within. I think this should be written to mention that the class is intended to use used with the corresponding codec.
* Model class for a BSON document. | ||
* | ||
* The internal data will be cast to an object during BSON serialization to | ||
* ensure that it becomes a BSON document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like LazyBSONArray, this comment appears to have been copied from the BSONDocument class. I think it needs to be rewritten. If these classes aren't going to directly implement our BSON Serializable interface, then I think this should just mention that these are only intended to be used with the corresponding Codec classes.
public function getIterator(): ListIterator | ||
{ | ||
// Sort keys to ensure they are in ascending order | ||
ksort($this->set, SORT_NUMERIC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. I think sorting here might create differing behavior between BSONArray and LazyBSONArray. Deferring any reordering until serialization time seems more consistent.
If these Lazy classes aren't going to implement BSON\Serializable and must rely on the codec for encoding, then I don't think we need to sort here. LazyBSONArrayCodec::encode()
already disregards keys by using $return[]
to build up the list used for constructing a PackedArray. This sorting appears to be here for the benefit of jsonSerialize()
, which currently only uses iterator_to_array()
.
If the user constructs a LazyBSONArray with an out-of-sequence PHP array (e.g. [1 => 'foo', 0 => 'bar']
, then I'd expect iteration to return the values "foo" and "bar" in sequence based on my previous experience with BSONArray, ArrayObject, and basic arrays. Moreover, BSONArray's BSON and JSON encoding would yield the sequence ["foo", "bar"]
. That's the existing behavior users get with BSONArray.
I think the common rule with both classes should be that "keys do not dictate list order" if we aim to be consistent.
And maybe that's the root question here: are we prioritizing consistency with BSONArray and PHP arrays, or are we trying to do something new with LazyBSONArray?
The current implementation of this class feels more like a mutable version of PackedArray, and I think that's the reason we're struggling so much. SplDoublyLinkedList can function as a mutable PackedArray, but trying to provide the sloppy ArrayObject or array API is inviting all sorts of edge cases.
$array[1] = 'bar'; | ||
$array[0] = 'baz'; | ||
|
||
// Since BSONArray uses array_values internally, the reverse order is expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my earlier comments, I'm not sure I'd refer to this as "reverse order". Rather, I'd say that "keys are disregarded" as the array's natural order is used.
|
||
public function jsonSerialize(): array | ||
{ | ||
return iterator_to_array($this->getIterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove key sorting from getIterator()
, I think it'd make sense to specify $preserve_keys = false
here (see: iterator_to_array()
) to ensure we reindex.
PHPLIB-1181
This PR introduces two new classes,
LazyBSONDocument
andLazyBSONArray
. They behave like their non-lazy counterparts, except that they are backed by BSON structures. When reading data, only requested data is read from the BSON and returned. This allows for more efficient handling of large data structures. When changing data in an instance of these classes, the changes are tracked and handled separately when the structure is once again encoded to BSON. Encoding and decoding data is handled by two separate codecs that create the lazy structures and inject an appropriate codec library into them. This is necessary as the rest of the data is only decoded when accessed, so the lazy structure needs to be aware of the codec.This PR also includes an
ArrayCodec
andObjectCodec
, which are used to traverse arrays andstdClass
instances to encode/decode their data. Decoding will most likely not happen as we expect to be decoding from BSON, but encoding is much more likely when a user creates a lazy BSON structure manually:When encoding this data, we need to traverse through the individual tags, necessitating the addition of codecs that handle arrays. Note that for simplicity, the codecs only access values. The structure is returned as it was (i.e. hashes are still returned as arrays), and keys are not changed at all. This makes them more suitable for other uses as well.