Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natively mapping BSON data types to wrapper classes or callbacks #900

Closed
ralfstrobel opened this issue Jul 27, 2018 · 16 comments
Closed

Natively mapping BSON data types to wrapper classes or callbacks #900

ralfstrobel opened this issue Jul 27, 2018 · 16 comments
Labels

Comments

@ralfstrobel
Copy link

Issues #246 and #370 discussed difficulties decoding mongo dates to DateTime with specific time zones. Developer consensus favors userland conversion/wrapping as the most flexible solution. What I would like to propose in this issue is a performance optimization and generalization for this process.

Problem Description

Our application uses a self-developed database abstraction layer, which translates UTCDateTime objects to native PHP DateTime objects and vice versa, so that mongo specifics are not exposed to higher business logic. However, since this layer is unaware of document structure, this requires a naive iteration over all loaded and stored documents, effectively the following...

array_walk_recursive($document, function (&$value) {
    if ($value instanceof UTCDateTime) {
        $value = $value->toDateTime()->setTimezone($this->timezone);
    }
});

array_walk_recursive($document, function (&$value) {
    if ($value instanceof DateTime) {
        $value = new UTCDateTime($value);
    }
});

For large documents with sparse use of date values, this recursive search and replace leads to a very significant performance loss, with retrieval of the translated data taking up to 3x as long as native retrieval from mongo. Removing the content of the callback function reveals that this is not even caused by the actual conversion of UTCDateTime to DateTime, but mostly by the iteration and closure callback overhead.

Proposed solution

While deserializing from BSON, the driver naturally encounters all $date values and could directly convert them to a desired format (such as DateTime or a PHPC-640 wrapper), avoiding the need for subsequent traversal.

This approach would require extending the TypeMap syntax in a way that allows definitions for primitive BSON data types. Multiple variations are imaginable:

a) Mapping data types to a conversion callback:

[
    'document' => 'array',
    'fieldPaths' => [...],
    'date' => function (UTCDateTime $utcDate) use ($timezone) {
        return $utcDate->toDateTime()->setTimezone($timezone);
    }
]

b) Mapping data types to a custom wrapper class:

[
    'document' => 'array',
    'fieldPaths' => [...],
    'date' => 'MyDateTimeClass'
]

class MyDateTimeClass implements DateTimeInterface, \MongoDB\BSON\Serializable
{
    private $date;
    
    public function __construct(UTCDateTime $utcDate)
    {
        $this->date = $utcDate->toDateTime();
    }
    ...
}

While $date is likely the most common use case, this solution could certainly also be applied for other types such as $binary or $oid.

@derickr
Copy link
Contributor

derickr commented Jul 31, 2018

Hi!

We have discussed this several times, and our conclusion was to not implement something in the TypeMap for this. You have already found #246 and #370, but there is a ticket too. There we have an example on how we think you should handle this case instead. This ticket (and your issue) has led to us writing a new tutorial, which just went through code review and will get published on https://docs.mongodb.com/php-library/master/tutorial/ .

cheers,
Derick

@derickr derickr closed this as completed Jul 31, 2018
@ralfstrobel
Copy link
Author

Your tutorial solution expects that the program logic which interacts with the driver is aware of the the stored data structure and hence knows exactly where date values are stored. Correct me if I'm wrong, but I do not see how this is useful in situations like ours, where the database abstraction layer is not schema aware.

@jmikola jmikola reopened this Aug 1, 2018
@derickr
Copy link
Contributor

derickr commented Aug 2, 2018

We have an outstanding ticket, for delegating type conversions to a (single) callable. We're not quite sure whether we will be implementing this, as we have not done any research into its performance impacts. Just like your assertion that deserialisation takes 3 times as long, we expect there to be quite a performance hit in the driver as well. Secondary, in order to prevent dataloss we need an accompanying way for serialising the converted data back into something that the database understands—for example, it needs to turn the non-native DateTime object back into a UTCDateTime datatype when storing. Henceforth, we are reluctant to commit to this for now.

Our current thinking is that type conversions should be done by model classes that an application has for representing the data. This is why we have the MongoDB\BSON\Serializable and MongoDB\BSON\Unserializable interfaces. You can only specify array, stdClass, or a class implementing the MongoDB\BSON\Unserializable interface for safety reasons, as allowing deserialisation to something else makes it possible to lose data writing the data back to the data base (the class might only have private properties, for example).

If you would set your top-level "document" type in the TypeMap to for example Model\User, and implement these two interfaces, the User's bsonSerialize and bsonUnserialize methods could handle the conversions. If there are nested models (say, addresses) embedded, then the fieldPath part of the TypeMap can do the conversions. For example:

<?php
class Model\User implements MongoDB\BSON\Serializable, MongoDB\BSON\Deserializable
{
    private $createdDate;

    function bsonUnserialize(array $data)
    {
        $this->createdDate = new \DateTimeImmutable($data['created_at']);
        …
    }

    function bsonSerialize()
    {
        return [
            'created_at' => new \MongoDB\BSON\UTCDateTime($this->createdDate),
        ];
    }
}

with a typeMap like:

[
    'document' => 'Model\User',
]

This does mean that your DBAL needs to know what data from which collections it is loading, but it does solve the round trip issue that your callback (a) can't solve.

Your option b currently can't work either. Your example says to implement MongoDB\BSON\Serializable, but the bsonSerialize method can currently not return anything besides an array or stdClass object. We have this written down in another ticket. Just like PHPC-999, we will need to do benchmarks and find out whether we went to take the performance penalty.

@jmikola
Copy link
Member

jmikola commented Aug 2, 2018

, as allowing deserialisation to something else makes it possible to lose data writing the data back to the data base (the class might only have private properties, for example).

Slight correction here. The argument for requiring array, stdClass, and Unserializable is that those are the only types for which the driver can be assured it can properly initialize a PHP value with BSON document/array data. It's straightforward for array/stdClass (we set keys or public properties) and for Unserializable we can rely on the bsonUnserialize() method and trust the user to initialize the object as they wish.

It is still possible to lose data with an Unserializable. For instance, bsonUnserialize() could simply ignore some data provided by the driver or it could store data in non-public properties and fail to implement Serializable and provide the data back to the driver via bsonSerialize(). The onus is on the user to take care of the PHP-to-BSON conversion later, and they can still screw up by only caring about BSON-to-PHP.

The important distinction is that we've at least given them all the tools to handle conversion to/from PHP and BSON properly and without a risk of data loss. They can easily implement Serializable in their own model classes. If we provide some mechanism to allow a "scalar" BSON type to be converted to an arbitrary PHP type (e.g. UTCDateTime to DateTime), the user effectively has no tools available to extend DateTime and ensure it serializes properly on its own.

Consider that type maps are only used in the BSON-to-PHP direction. The common code path for that conversion is Cursor::setTypeMap() (excluding the BSON functions). There is no common code path for PHP-to-BSON conversion, as it happens throughout the driver (e.g. Command and Query objects, BulkWrite methods). Any solution would entail significant API changes, which is reason enough to do so cautiously and only have extensive prototyping and benchmarking.

@ralfstrobel
Copy link
Author

Ah, sorry, I had not found PHPC-999 and PHPC-1000 previously, which indeed suggest effectively the same as I did in my original post.

we need an accompanying way for serialising the converted data back into something that the database understands

Truth be told, this is something that was bugging my about my own proposal as well. My first thought after investigating our performance problem was actually that the driver should simply have a straight-forward option to unserialize BSON dates natively to PHP DateTime instead of UTCDateTime. If accompanied by the native ability to serialize DateTime(Interface) objects, this would likely provide the best performance and maintain the serialize/unserialize symmetry to avoid data loss.

However, in order to be useful to us, this feature would have to include an option to specify the desired time zone for the DateTime instances created by the unserializer. Otherwise we would still end up having to do our recursive data walk in order to set the time zone on all date objects. Given the previous rejection of PHPC-760, I decided to propose a generic approach instead.

Whatever the technical solution, I believe it is not an outlandish request to want to work conveniently with date values in the form of the DateTime objects PHP native provides (in the correct local time zone). Hence the number of issues/tickets related to this general subject. Date handling with PHP/Mongo has been a longstanding source of complaints and debates in my team, already going back to the old mongo driver. I'm sure many more developers feel the same way and would welcome any simplification you can send our way.

Our current thinking is that type conversions should be done by model classes that an application has for representing the data. (...) This does mean that your DBAL needs to know what data from which collections it is loading.

Herein lies the problem. I do not believe that a database driver should impose a bias towards certain programming paradigms or design patterns, if neither the underlying database nor the programming language do so.

Mongo is a database for unschematized semi-structured data. Which is why it is great for our application, which deals with (partially) unschematized semi-structured data, using procedural programming and transformer / pipeline patterns. Yet you want to tell me that a schema-aware OOP representation of our database documents is the only supported way to perform primitive type conversions efficiently.

Don't get me wrong. All the OOP support options built into the serialization/unserialization logic are convenient and elegant for many applications, but they are not a one-size-fits-all solution.

@TCB13
Copy link

TCB13 commented Mar 7, 2019

I'm with @ralfstrobel it should be added to TypeMap, his proposed syntax would be a good addition. From a practical point of view, the solution proposed by @derickr adds a lot of overhead that we don't need on most apps.

in order to prevent dataloss we need an accompanying way for serialising the converted data back into something that the database understands

That won't be a problem if you decide to implement the
b) Mapping data types to a custom wrapper class .

@jmikola
Copy link
Member

jmikola commented Mar 7, 2019

@TCB13: Thanks for bumping this. @ralfstrobel makes a good argument here and I think PHPC-999 will ultimately be the best solution as it would allow applications to specify a callable to which the driver can delegate type conversions. We have a number of higher priority tasks at the moment (cross-driver features to support new server functionality) but know that the issue is still on our radar to POC and investigate further once we can return to PHP-specific work.

@jmikola
Copy link
Member

jmikola commented Nov 15, 2019

Closing, but feel free to follow the aforementioned tickets for updates. They've also been organized under an epic (PHPC-1454) to track the theme of ODM improvements.

@jmikola jmikola closed this as completed Nov 15, 2019
@ralfstrobel
Copy link
Author

There has finally been some development on this issue, though sadly not in the direction I had in mind. PHPC-999 has now been closed as won't fix. The new intended direction seems to be PHPLIB-1172, based on PHPC-2242.

The latter certainly opens up an interesting new approach. I will try out the new BSON/Document class, though I fear manually iterating over every value and deserializing it will also be quite slow and no advantage over conversion to array and then walking the array structure.

@alcaeus
Copy link
Member

alcaeus commented Aug 3, 2023

Hi Ralf,

there has been indeed some movement on this. In version 1.16 of the extension, we introduced two new classes to handle raw BSON: MongoDB\BSON\Document and MongoDB\BSON\PackedArray. Let me explain the direction we're taking this.

The first purpose of MongoDB\BSON\Document is providing an object oriented API for functions in the MongoDB\BSON namespace (fromPHP, fromBSON, fromJSON, toPHP, toCanonicalExtendedJSON, toRelaxedExtendedJSON). This class makes it easier to handle raw BSON by allowing you to get keys and iterate over all keys, which wasn't possible without first converting the raw BSON string to a PHP object.

The Document class can also be used when reading values from the database by passing a ['root' => 'bson'] typeMap wherever it is accepted. When working with a Document instance, any embedded document will also be returned as Document instance, while arrays are returned as PackedArray instances.

The previous deserialisation mechanism (including passing Unserializable instances in a typeMap and using Persistable instances) always recursively traversed any BSON structure, deserialising the entire document including embedded fields. This also worked from the bottom up, meaning that deserialisation (i.e. calling Unserializable::bsonUnserialize) always started deep in the BSON document and recursed upwards toward the root document. This can have big implications when working with deeply nested structures or large documents in general, especially in an ODM context where projections typically aren't used but not all properties may be accessed.

However, introduction of these raw BSON classes is only half the story. In the PHP library, we're adding a codec mechanism to allow users to customise how BSON documents are deserialised to PHP objects and vice versa. One big drawback of the current mechanism is that serialisation needs to be handled by the object itself (through using the Serializable, Unserializable, and Persistable interfaces). Unfortunately, this makes this mechanism unsuitable in certain environments. For example, using a value object from a third-party library (e.g. MoneyPHP) wouldn't be possible, as the value objects are final and there is no way for you to introduce serialisation logic to it. This would mean creating a separate value object for the sole purpose of serialising and deserialising such data.

This brings me to Codecs. Essentially, Codecs are the factories that we considered including introducing to the typeMap, except that we're introducing them in the PHP library, as the extension is only a low-level abstraction over libmongoc taking care of low-level functionality (such as SDAM, the wire protocol, and handling BSON).

A common use case is to want to work with PHP's date classes instead of the BSON UTCDateTime class. We can create a codec for such a use case:

class DateTimeCodec implements DocumentCodec
{
    public function canEncode($value): bool
    {
        return $value instanceof DateTimeInterface;
    }

    public function encode($value): Document
    {
        // todo: throw if can't encode
        return Document::fromPHP([
            'utcDateTime' => new UTCDateTime($value),
            'tz' => $value->getTimeZone()->getName(),
        ]);
    }

    public function canDecode($value): bool
    {
        // Todo: can be optimised
        return $value instanceof Document && $value->has('utcDateTime') && $value->has('tz');
    }

    public function decode($value): DateTimeImmutable
    {
        // todo: throw if can't decode
        $dateTime = $value->get('utcDateTime')->toDateTime();
        $dateTime->setTimeZone(new DateTimeZone($value->get('tz')));

        return DateTimeImmutable::createFromMutable($dateTime);
    }
}

We can now decouple the serialisation logic from the actual object, meaning that any POPO (plain old PHP object) can now be stored in the database without being restricted to only public properties being serialised. If we want to combine them, we can still have our model class implement the DocumentCodec interface and contain the logic there.

Now that we have our codec, let's use it. PHPLIB will come with a LazyBSONDocumentCodec, which will create lazy BSON structures for use in your application. These lazy structures will behave like the BSONArray and BSONDocument classes, except that they don't eagerly load data. You can follow the current work in the mongodb/mongo-php-library#1135 (introducing the lazy classes) and mongodb/mongo-php-library#1140 (adding codec support to the Collection class). When all is said and done, you'd do the following:

$client = new Client();
$codec = new LazyBSONDocumentCodec();
$collection = $client->selectCollection('my_db', 'my_coll', ['codec' => $codec]);

$document = $collection->findOne(['_id' => 1]);

$document will be an instance of LazyBSONDocument holding a Document instance introduced here, and when accessing data that hasn't been read will read it from the BSON and decode it using a codec. We can add our previously created codec to the library:

$library = new LazyBSONCodecLibrary();
$library->attachCodec(new DateTimeCodec());

$codec->attachCodecLibrary($library);

When we use this codec, any DateTimeInterface instances will be stored as an embedded document containing a UTCDateTime and a timezone identifier. When reading data from the database, this is then decoded to a DateTimeImmutable with the correct timezone set.

As for the performance of this, it goes without saying that not eagerly decoding big BSON structures provides a big performance boost and saves memory. I am currently working on performance benchmarks to keep an eye on these savings and eventual cost due to codecs being involved. I can share some performance numbers for the BSON structures:

+------------------+------------------+-----+------+-----+----------+-----------+--------+
| benchmark        | subject          | set | revs | its | mem_peak | mode      | rstdev |
+------------------+------------------+-----+------+-----+----------+-----------+--------+
| DocumentBench    | benchToPHPObject |     | 5    | 3   | 25.365mb | 24.287ms  | ±1.09% |
| DocumentBench    | benchToPHPArray  |     | 5    | 3   | 25.365mb | 24.158ms  | ±0.17% |
| DocumentBench    | benchCheckFirst  |     | 5    | 3   | 6.498mb  | 1.000μs   | ±0.00% |
| DocumentBench    | benchCheckLast   |     | 5    | 3   | 6.498mb  | 2.935ms   | ±0.25% |
| DocumentBench    | benchAccessFirst |     | 5    | 3   | 6.498mb  | 1.012μs   | ±8.84% |
| DocumentBench    | benchAccessLast  |     | 5    | 3   | 6.498mb  | 2.938ms   | ±0.21% |
| DocumentBench    | benchIteration   |     | 5    | 3   | 6.498mb  | 113.871ms | ±0.36% |
+------------------+------------------+-----+------+-----+----------+-----------+--------+

The test uses a 2.7 MB file that we use for performance benchmarking our BSON implementation. It contains more than 94000 keys. You can see that converting the structure to a PHP object or array (using type maps) takes around 24 ms. Directly accessing the first element takes 1 microsecond, while accessing the last element (which involves libbson continuously advancing the internal BSON pointer) takes roughly 3 ms. However, we can also see that iteration takes relatively long due to how PHP iterators work (each iteration step in a foreach loop involves 4 method calls, so this scales horribly). Your use case will definitely influence how you work with such structures - whether it's more efficient to convert the whole thing to a PHP structure (because you're accessing everything), or whether you work with the raw BSON because you're only working with single fields). Memory usage also plays a certain part, as not converting everything to PHP values at once clearly uses less memory.

In the DateTimeCodec example above, there is still potential for performance optimisation. For example, checking each BSON document when decoding to see whether it contains utcDateTime and tz keys clearly isn't very optimised and can be improved. When using the default lazy codecs, there isn't a better way to handle this without making the system overly complicated. However, the idea is that client libraries with more knowledge of the structure will handle this. For example, an upcoming effort will be to leverage codecs in the Doctrine MongoDB ODM to replace the current hydrator and type systems, leveraging lazy objects ("proxies") in the process.

Please note that all the codec APIs are currently a work in progress and may change until we release them in our 1.17 library release. The MongoDB\BSON\Document and MongoDB\BSON\PackedArray classes are considered stable and you can already use them. I am also working on examples and tutorials to explain the usage of this and to show people how this new system can make it easier to work with their data.

Last but not least, I'm curious to hear about your use case. Clearly you have had an interest in this functionality for a long time, so I'm interested to know whether you think this system will work for you, and if it wouldn't, what we can improve to make it more useful.

Please note that all code examples have been written in this comment box without IDE support. All syntax errors potentially found in these examples have most certainly been introduced only for your reading pleasure.

@ralfstrobel
Copy link
Author

ralfstrobel commented Aug 3, 2023

Hi Andreas,

thank you for the detailed explanation. I do really like how you have finally made BSON fully natively represented in PHP and I'm sure your new approach is very useful and convenient for many applications. However, I do not see it being useful for us...

Quoting myself from 2018...

Mongo is a database for unschematized semi-structured data. Which is why it is great for our application, which deals with (partially) unschematized semi-structured data, using procedural programming and transformer / pipeline patterns.

We do not know the semantics of what we are storing beyond the data type of atomic values. We do not use model objects that need to be serialized and deserialized or could provide a lazy decoding mechanism. We do in fact not even use your driver library. Instead we purely interact with the driver extension to pump associative PHP arrays in and out of the database as fast as we possibly can. Our only problem being that these arrays can also contain temporal values, represented as DateTime objects, which is where this discussion originated.

Due to our need for maximum performance, a PHP userland implementation of a codec that has to be asked if it can decode every single value is not going to work for us. And, as I said before, likely neither is PHP userland traversal of your BSON representation objects each time we need to load a document. But I will give that a try to see how practical it is for us.

@alcaeus
Copy link
Member

alcaeus commented Aug 3, 2023

Hi Ralf,

thank you for explaining your use case. There are multiple points in your response I want to elaborate on.

We do not know the semantics of what we are storing beyond the data type of atomic values. [...] Our only problem being that these arrays can also contain temporal values, represented as DateTime objects, which is where this discussion originated.

I apologise if this sounds somewhat condescending, but that is a relatively straightforward use case that can already be accomplished with little to no performance impact. I'll also note that the introduction of the Document class will not make that process any more painful or slower for you (I'm still working on reducing the time required for iteration). However, the goal for the driver goes much further than allowing returning a different specific class for each BSON type as explained above.

Instead we purely interact with the driver extension to pump associative PHP arrays in and out of the database as fast as we possibly can. [...] We do in fact not even use your driver library.

I understand the need for performance when inserting or reading data. I just cooked up a small performance benchmark using the same large document above. I'll note that we have some time set aside later this quarter to implement more performance benchmarks, but here's the result of a quick test. The insertOne test inserts data into a single instance, the insertMany inserts 10 documents. I'll just leave the results here:

+-------------+---------------------------+-----+------+-----+----------+-----------+--------+
| benchmark   | subject                   | set | revs | its | mem_peak | mode      | rstdev |
+-------------+---------------------------+-----+------+-----+----------+-----------+--------+
| InsertBench | benchInsertOneCollection  |     | 5    | 5   | 21.302mb | 52.017ms  | ±0.61% |
| InsertBench | benchInsertOneManager     |     | 5    | 5   | 20.875mb | 54.867ms  | ±1.57% |
| InsertBench | benchInsertManyCollection |     | 5    | 5   | 21.302mb | 518.906ms | ±0.59% |
| InsertBench | benchInsertManyManager    |     | 5    | 5   | 20.875mb | 522.514ms | ±0.27% |
+-------------+---------------------------+-----+------+-----+----------+-----------+--------+

Without going into too much details, the time margin of error in these benchmarks accounts for any time differences, so there isn't too much to be gained by using the extension only. I also observed similar results when running the same bench for reading data:

+-------------+------------------------------+-----+------+-----+-----------+-----------+---------+
| benchmark   | subject                      | set | revs | its | mem_peak  | mode      | rstdev  |
+-------------+------------------------------+-----+------+-----+-----------+-----------+---------+
| SelectBench | benchFindAllCollection       |     | 5    | 5   | 1.959mb   | 9.421ms   | ±19.44% |
| SelectBench | benchFindAllManager          |     | 5    | 5   | 1.867mb   | 10.765ms  | ±11.89% |
| SelectBench | benchCursorToArrayCollection |     | 5    | 5   | 172.669mb | 255.085ms | ±0.25%  |
| SelectBench | benchCursorToArrayManager    |     | 5    | 5   | 167.232mb | 249.619ms | ±0.37%  |
+-------------+------------------------------+-----+------+-----+-----------+-----------+---------+

Note that these tests are done on the code that already knows about codecs but without a codec being set on the collection. You're essentially giving up a lot of usability for little to no performance gain.

Due to our need for maximum performance, a PHP userland implementation of a codec that has to be asked if it can decode every single value is not going to work for us.

I'll note that any solution that goes beyond "always decode this one particular BSON type this particular way" will involve the extension asking some userland implementation "can you decode this value". If you provide a closure to the extension to invoke, this isn't going to be magically faster because it is invoked from the extension rather than from a PHP script.

@ralfstrobel
Copy link
Author

ralfstrobel commented Aug 3, 2023

the goal for the driver goes much further than allowing returning a different specific class for each BSON type as explained above. (...)
I'll note that any solution that goes beyond "always decode this one particular BSON type this particular way" will involve the extension asking some userland implementation "can you decode this value".

Yes, I understand that. And again, I am not saying that the work you have put into the new codec concept will not be extremely useful for many other projects. However, our use-case really is as simple as "always decode this one particular BSON type this particular way". And we wish to do this with as little overhead as possible.

If you provide a closure to the extension to invoke, this isn't going to be magically faster because it is invoked from the extension rather than from a PHP script.

Sure, but my point was never about the performance of closure invocations, but about unnecessary iteration and type checks in PHP userland...

Note that any BSON type except Date is currently handled exactly the way we want it by the TypeMap "array" mode. Solely for Date values would we wish to invoke a closure. Date values make up approximately 3% of the values we store.

Currently we retrieve documents as "array". So the extension already has to iterate over the entire BSON document structure, look at every value and its type, and then convert it either to a scalar zval or one of the BSON class instances. Afterwards, we have to iterate over the entire returned array again in PHP userland and look at every value again to check if it is a UTCDateTime instance we wish to convert (pointlessly 97% of the time). My original proposal was all about eliminating this second iteration by having the extension defer to a closure only when it actually encountered a BSON Date value.

Using the new "bson" documents, it is true that everything could be done in only one iteration, but this would take place in PHP userland, which I imagine is slower than in C. We would also have to handle recursion for Document and PackedArray as well as conversion of Int64 by hand. Like I said, I will test this, but I am not very hopeful it is vastly superior to the array method.

@ralfstrobel
Copy link
Author

I finally found the time to do some representative synthetic benchmarking...

Read 100000 documents as array in 0.3938 sec
Read 100000 documents as bson in 0.1242 sec

Basic retrieval of documents is certainly a lot faster in bson mode, as long as you do not need to work with the data.

However, eventually our business logic needs them as arrays with UTCDateTime values converted to DateTime...

Read 100000 documents as array in 0.8143 sec (with post-traversal to convert dates)
Read 100000 documents as bson in 1.1283 sec (full traversal to convert to array)

Our current logic which retrieves documents as array and then needs to traverse each array again to find and convert dates almost cuts the retrieval performance in half. The new alternative to retrieve objects as BSON wrapper objects and then recursively traverse this structure to retrieve a converted array is even slower.


To be fair to the new BSON approach, it was not designed to improve retrieval performance of full documents, but to improve retrieval of a few sparse fields from each objects...

Read 100000 sparse documents as array in 1.0013 sec (with post-traversal to convert dates)
Read 100000 sparse documents as bson in 0.8292 sec

Here the new approach could theoretically have a slight edge over our current approach, which needs to traverse the full document array unnecessarily to convert dates which may not even be retrieved later. But this really only applies assuming the database was not already given a projection to reduce the document fields to those which will be retrieved later.


@alcaeus I'm afraid there is not much news here. The new BSON option is certainly nice, just not for us. If the driver had an option to return BSON date values as PHP DateTimes natively or yield to a callback, this could still almost double our retrieval performance.

@alcaeus
Copy link
Member

alcaeus commented Mar 5, 2024

Hi @ralfstrobel, would you be able to share the code you used for the benchmarks above? I'd like to take a look to see where we can find some room for improvements. We're currently not planning any significant changes in the BSON API, but we are thinking about how users interact with the MongoDB data structures in the future. I'll caution that any kind of callback API (e.g. being able to pass a closure in charge of decoding a specific BSON type) might not be very performance-efficient either.

The main performance issue in your case is always going to be deep-traversing the entire structure looking for UTCDateTime values, and converting them to DateTimeInterface values. I'm not sure you much performance would be gained by doing that in a callback, but I'm happy to take a look. If you can also share a sample document (with any personal data anonymised) for me to test on, that would make this a little easier as well. When redacting any sensitive information, please try to keep the size of the BSON structure the same (e.g. by changing sensitive strings to a random string of the same length).

@ralfstrobel
Copy link
Author

Hi @alcaeus, thanks for getting back to me so quickly!

The specific test code I used is somewhat entangled with our query and testing infrastructure, but I've distilled the important parts into a standalone script that essentially does the same things and shows similar results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants