Enable forced sync to disk (fsync) for NativeStore #3187
Replies: 13 comments
-
|
Beta Was this translation helpful? Give feedback.
-
Currently causing 1/2 the throughput with forced sync compared to lazy sync for small transactions. |
Beta Was this translation helpful? Give feedback.
-
That's quite significant, and not something I'm immediately keen to make "the new default", to be honest. If I remember correctly, what force sync does is an IO level forcing of writing bytes to disk, where normally java nio has some internal leeway for buffering things. It's not that we are caching loads of stuff in memory ourselves, this is optimization at the jvm level. |
Beta Was this translation helpful? Give feedback.
-
I’ll want to try to find out what is slow first. But one thing we really need to ask ourselves is how NativeStore is marketed? If it’s compared to things like an SQL database, then people will expect the database to survive a catastrophic failure like a power outage. Then comes the question of whether this forced sync actually does this? And if there isn’t some way to do it more efficiently. |
Beta Was this translation helpful? Give feedback.
-
Force syncNot forced |
Beta Was this translation helpful? Give feedback.
-
Updated benchmarks:
|
Beta Was this translation helpful? Give feedback.
-
I think that there should be a way to speed up fsync for larger transactions. At the moment we are calling fsync after each write, but it is sufficient to call fsync at the end of the transaction. This would still impact small transactions, but less so as the transactions get larger. We could also definitely use buffered writes to speed of larger transactions. It makes little sense to use a new write operation for every statement if you know that there are 100 statements. Is there a Write Ahead Log for the NativeStore? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I follow. The 'force sync' is part of the general data sync, which is normally only invoked when a Sink is flushed or the transaction is committed or rolled back. Are you saying it happens on individual writes somewhere?
I'm not sure how buffered writes fit into this story to be honest. Wouldn't buffered writers would make things worse in terms of resilience against power outages and other catastrophes? More generally: I'm sure you've noticed already, but a lot of the native store's internals were designed when Java 1.4, and therefore java.nio, was still fairly new. It relies a lot on fairly low-level NIO operations and its guarantees around consistency and non-blocking io. I'd like to see if we could improve that (I'm sure we can, a lot has changed since java 1.4), but I'm not sure more buffering on top of that is the answer.
There is no separate write-ahead log - statements are immediately added to the triple indexes, but with status flags to indicate how they're part of the active transaction (newly added, previously removed but now added again, removed, modified from implicit to explicit, etc) - atomicity is guaranteed due to how those flags are processed when a transaction has to be committed, rolled back, or replayed. The system with bit flags also means that we can do a lot of in-place updates in the store. I wouldn't be against a more explicit WAL approach - if nothing else it would make the system more auditable, and there's a good chance that if we set it up right, we can improve throughput. I think the ValueStore is probably the weakest link in the whole native store setup. It's a single lookup dictionary mapping resource values to internal identifiers, which are then used to store the data in the Triple Indexes. Not only does that mean that we have to assign and store values in this dictionary for every added triple (which even with all the caching put in place around it is probably not the most performant part), it also means that we have a single point of failure in case of file corruption: the triple indexes are resilient in that they can clean themselves up if triples are left "dangling" (the flags mentioned earlier give enough info for that), and even if one index becomes corrupted somehow it can usually be restored by looking at the other indexes. But the value store id file, if it becomes corrupted (for example due to failure in the middle of a write) is very difficult to recover (there's no automated recovery that I know of), and if it's corrupted the entire database is essentially useless. Even worse is probably that it grows dirty over time: if I remember correctly we never got around to writing a "clean up" operation for it, to remove mappings that are for resources no longer present in the store. |
Beta Was this translation helpful? Give feedback.
-
I'm wondering if an approach with a reliable MD hash instead of this integer id based on file offsets might be more performant in this day and age. It would certainly save us half the lookups. Of course hashes will be bigger than integers, so you'd lose something on the performance of the indexes. And of course there's a theoretical risk of collision. |
Beta Was this translation helpful? Give feedback.
-
I use sha256 in the elasticsearch store. I have collision detection, but I haven’t gotten around to doing collision handling. A few years ago I started thinking that we could do really fast query processing if we used IDs (preferably hash) to do things like joins, and only get the actual rdf value when strictly needed (like projections or comparison based filters). It’s in no way realistic for RDF4J, just seemed very intriguing. As for the fsync operations. I would expect that as the transactions become larger there would be less effect of using fsync, since fsync should only be called when all the writes are complete. Since the performance of large transactions with fsync is 50% slower I think we are calling fsync on every write. |
Beta Was this translation helpful? Give feedback.
-
Btw. I've come to realise that fsync is not about caching in the Java layer, and also that FileChannel has no cache in the Java layer. It's all about the OS level cache. From the linux man pages on calling close on a file:
We should consider calling fsync when closing the native store files to ensure that everything is at least flushed when the user calls .shutdown(). |
Beta Was this translation helpful? Give feedback.
-
....we don't currently do that? Wow. Yes, that sounds like an excellent idea. |
Beta Was this translation helpful? Give feedback.
-
With SHA-256, the probability of a collision occurring after generating 4.8E+35 hashes (which, I think you'll agree is significantly more than we're ever likely to store in any database) is about 1E-6. Let's just say that collision detection and handling is not a high priority - you're more likely to die by an asteroid hit than ever observing a SHA-256 collision. See the table here for some decent approximations: https://en.wikipedia.org/wiki/Birthday_attack. We're probably safe with a 128 bit hash. I mean, UUIDs are 128 bits and they're generally agreed to be unique for all practical purposes. |
Beta Was this translation helpful? Give feedback.
-
NativeStore has support for precise syncing writes to disk to reduce the chance that a power outage will cause dataloss.
For performance this has been disabled by default.
We would now like to revisit this.
Beta Was this translation helpful? Give feedback.
All reactions