Enable forced sync to disk (fsync) for NativeStore #3187

hmottestad · 2020-04-23T20:56:16Z

hmottestad
Apr 23, 2020
Maintainer

NativeStore has support for precise syncing writes to disk to reduce the chance that a power outage will cause dataloss.

For performance this has been disabled by default.

We would now like to revisit this.

hmottestad · 2020-04-23T21:37:36Z

hmottestad
Apr 23, 2020
Maintainer Author

Benchmark                                                            Mode  Cnt    Score     Error  Units
TransactionsPerSecondBenchmark.largerTransaction                    thrpt    5    7.366 ±   0.313  ops/s
TransactionsPerSecondForceSyncBenchmark.largerTransaction           thrpt    5    5.497 ±   0.152  ops/s

TransactionsPerSecondBenchmark.largerTransactionLevelNone           thrpt    5    7.323 ±   0.265  ops/s
TransactionsPerSecondForceSyncBenchmark.largerTransactionLevelNone  thrpt    5    5.583 ±   0.138  ops/s

TransactionsPerSecondBenchmark.transactions                         thrpt    5  823.769 ±  68.464  ops/s
TransactionsPerSecondForceSyncBenchmark.transactions                thrpt    5  445.153 ±  34.173  ops/s

TransactionsPerSecondBenchmark.transactionsLevelNone                thrpt    5  828.420 ± 104.411  ops/s
TransactionsPerSecondForceSyncBenchmark.transactionsLevelNone       thrpt    5  448.283 ±  56.338  ops/s

0 replies

hmottestad · 2020-04-23T21:38:47Z

hmottestad
Apr 23, 2020
Maintainer Author

Currently causing 1/2 the throughput with forced sync compared to lazy sync for small transactions.

0 replies

abrokenjester · 2020-04-24T00:45:47Z

abrokenjester
Apr 24, 2020
Maintainer

Currently causing 1/2 the throughput with forced sync compared to lazy sync for small transactions.

That's quite significant, and not something I'm immediately keen to make "the new default", to be honest.

If I remember correctly, what force sync does is an IO level forcing of writing bytes to disk, where normally java nio has some internal leeway for buffering things. It's not that we are caching loads of stuff in memory ourselves, this is optimization at the jvm level.

0 replies

hmottestad · 2020-04-24T05:35:38Z

hmottestad
Apr 24, 2020
Maintainer Author

I’ll want to try to find out what is slow first.

But one thing we really need to ask ourselves is how NativeStore is marketed? If it’s compared to things like an SQL database, then people will expect the database to survive a catastrophic failure like a power outage. Then comes the question of whether this forced sync actually does this? And if there isn’t some way to do it more efficiently.

0 replies

hmottestad · 2020-04-24T10:24:37Z

hmottestad
Apr 24, 2020
Maintainer Author

Force sync

Not forced

0 replies

hmottestad · 2020-04-24T13:14:28Z

hmottestad
Apr 24, 2020
Maintainer Author

Updated benchmarks:

Benchmark                                                                Mode  Cnt    Score     Error  Units
TransactionsPerSecondBenchmark.largerTransaction                        thrpt    5    7.544 ±   0.236  ops/s
TransactionsPerSecondForceSyncBenchmark.largerTransaction               thrpt    5    5.543 ±   0.278  ops/s

TransactionsPerSecondBenchmark.largerTransactionLevelNone               thrpt    5    7.566 ±   0.225  ops/s
TransactionsPerSecondForceSyncBenchmark.largerTransactionLevelNone      thrpt    5    5.591 ±   0.179  ops/s

TransactionsPerSecondBenchmark.mediumTransactionsLevelNone              thrpt    5  807.462 ± 157.766  ops/s
TransactionsPerSecondForceSyncBenchmark.mediumTransactionsLevelNone     thrpt    5  412.228 ±  39.249  ops/s

TransactionsPerSecondBenchmark.transactions                             thrpt    5  898.120 ± 189.978  ops/s
TransactionsPerSecondForceSyncBenchmark.transactions                    thrpt    5  464.935 ±  18.470  ops/s

TransactionsPerSecondBenchmark.transactionsLevelNone                    thrpt    5  877.263 ± 315.073  ops/s
TransactionsPerSecondForceSyncBenchmark.transactionsLevelNone           thrpt    5  453.700 ±  35.615  ops/s

TransactionsPerSecondBenchmark.veryLargerTransactionLevelNone           thrpt    5    0.060 ±   0.007  ops/s
TransactionsPerSecondForceSyncBenchmark.veryLargerTransactionLevelNone  thrpt    5    0.026 ±   0.001  ops/s

0 replies

hmottestad · 2020-04-24T17:08:19Z

hmottestad
Apr 24, 2020
Maintainer Author

I think that there should be a way to speed up fsync for larger transactions. At the moment we are calling fsync after each write, but it is sufficient to call fsync at the end of the transaction. This would still impact small transactions, but less so as the transactions get larger. We could also definitely use buffered writes to speed of larger transactions. It makes little sense to use a new write operation for every statement if you know that there are 100 statements.

Is there a Write Ahead Log for the NativeStore?

0 replies

abrokenjester · 2020-04-25T00:55:11Z

abrokenjester
Apr 25, 2020
Maintainer

I think that there should be a way to speed up fsync for larger transactions. At the moment we are calling fsync after each write, but it is sufficient to call fsync at the end of the transaction.

I'm not sure I follow. The 'force sync' is part of the general data sync, which is normally only invoked when a Sink is flushed or the transaction is committed or rolled back. Are you saying it happens on individual writes somewhere?

This would still impact small transactions, but less so as the transactions get larger. We could also definitely use buffered writes to speed of larger transactions. It makes little sense to use a new write operation for every statement if you know that there are 100 statements.

I'm not sure how buffered writes fit into this story to be honest. Wouldn't buffered writers would make things worse in terms of resilience against power outages and other catastrophes?

More generally: I'm sure you've noticed already, but a lot of the native store's internals were designed when Java 1.4, and therefore java.nio, was still fairly new. It relies a lot on fairly low-level NIO operations and its guarantees around consistency and non-blocking io. I'd like to see if we could improve that (I'm sure we can, a lot has changed since java 1.4), but I'm not sure more buffering on top of that is the answer.

Is there a Write Ahead Log for the NativeStore?

There is no separate write-ahead log - statements are immediately added to the triple indexes, but with status flags to indicate how they're part of the active transaction (newly added, previously removed but now added again, removed, modified from implicit to explicit, etc) - atomicity is guaranteed due to how those flags are processed when a transaction has to be committed, rolled back, or replayed. The system with bit flags also means that we can do a lot of in-place updates in the store.

I wouldn't be against a more explicit WAL approach - if nothing else it would make the system more auditable, and there's a good chance that if we set it up right, we can improve throughput.

I think the ValueStore is probably the weakest link in the whole native store setup. It's a single lookup dictionary mapping resource values to internal identifiers, which are then used to store the data in the Triple Indexes. Not only does that mean that we have to assign and store values in this dictionary for every added triple (which even with all the caching put in place around it is probably not the most performant part), it also means that we have a single point of failure in case of file corruption: the triple indexes are resilient in that they can clean themselves up if triples are left "dangling" (the flags mentioned earlier give enough info for that), and even if one index becomes corrupted somehow it can usually be restored by looking at the other indexes. But the value store id file, if it becomes corrupted (for example due to failure in the middle of a write) is very difficult to recover (there's no automated recovery that I know of), and if it's corrupted the entire database is essentially useless. Even worse is probably that it grows dirty over time: if I remember correctly we never got around to writing a "clean up" operation for it, to remove mappings that are for resources no longer present in the store.

0 replies

abrokenjester · 2020-04-25T05:56:30Z

abrokenjester
Apr 25, 2020
Maintainer

I'm wondering if an approach with a reliable MD hash instead of this integer id based on file offsets might be more performant in this day and age. It would certainly save us half the lookups. Of course hashes will be bigger than integers, so you'd lose something on the performance of the indexes. And of course there's a theoretical risk of collision.

0 replies

hmottestad · 2020-04-25T07:04:09Z

hmottestad
Apr 25, 2020
Maintainer Author

I'm wondering if an approach with a reliable MD hash instead of this integer id based on file offsets might be more performant in this day and age. It would certainly save us half the lookups. Of course hashes will be bigger than integers, so you'd lose something on the performance of the indexes. And of course there's a theoretical risk of collision.

I use sha256 in the elasticsearch store. I have collision detection, but I haven’t gotten around to doing collision handling.

A few years ago I started thinking that we could do really fast query processing if we used IDs (preferably hash) to do things like joins, and only get the actual rdf value when strictly needed (like projections or comparison based filters). It’s in no way realistic for RDF4J, just seemed very intriguing.

As for the fsync operations. I would expect that as the transactions become larger there would be less effect of using fsync, since fsync should only be called when all the writes are complete. Since the performance of large transactions with fsync is 50% slower I think we are calling fsync on every write.

0 replies

hmottestad · 2020-04-30T07:18:31Z

hmottestad
Apr 30, 2020
Maintainer Author

Btw. I've come to realise that fsync is not about caching in the Java layer, and also that FileChannel has no cache in the Java layer. It's all about the OS level cache.

From the linux man pages on calling close on a file:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes.

We should consider calling fsync when closing the native store files to ensure that everything is at least flushed when the user calls .shutdown().

0 replies

abrokenjester · 2020-04-30T22:21:08Z

abrokenjester
Apr 30, 2020
Maintainer

We should consider calling fsync when closing the native store files to ensure that everything is at least flushed when the user calls .shutdown().

....we don't currently do that? Wow. Yes, that sounds like an excellent idea.

0 replies

abrokenjester · 2020-04-30T22:49:39Z

abrokenjester
Apr 30, 2020
Maintainer

I use sha256 in the elasticsearch store. I have collision detection, but I haven’t gotten around to doing collision handling.

With SHA-256, the probability of a collision occurring after generating 4.8E+35 hashes (which, I think you'll agree is significantly more than we're ever likely to store in any database) is about 1E-6. Let's just say that collision detection and handling is not a high priority - you're more likely to die by an asteroid hit than ever observing a SHA-256 collision. See the table here for some decent approximations: https://en.wikipedia.org/wiki/Birthday_attack.

We're probably safe with a 128 bit hash. I mean, UUIDs are 128 bits and they're generally agreed to be unique for all practical purposes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable forced sync to disk (fsync) for NativeStore #3187

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Enable forced sync to disk (fsync) for NativeStore #3187

hmottestad Apr 23, 2020 Maintainer

Replies: 13 comments

hmottestad Apr 23, 2020 Maintainer Author

hmottestad Apr 23, 2020 Maintainer Author

abrokenjester Apr 24, 2020 Maintainer

hmottestad Apr 24, 2020 Maintainer Author

hmottestad Apr 24, 2020 Maintainer Author

Force sync

Not forced

hmottestad Apr 24, 2020 Maintainer Author

hmottestad Apr 24, 2020 Maintainer Author

abrokenjester Apr 25, 2020 Maintainer

abrokenjester Apr 25, 2020 Maintainer

hmottestad Apr 25, 2020 Maintainer Author

hmottestad Apr 30, 2020 Maintainer Author

abrokenjester Apr 30, 2020 Maintainer

abrokenjester Apr 30, 2020 Maintainer

hmottestad
Apr 23, 2020
Maintainer

hmottestad
Apr 23, 2020
Maintainer Author

hmottestad
Apr 23, 2020
Maintainer Author

abrokenjester
Apr 24, 2020
Maintainer

hmottestad
Apr 24, 2020
Maintainer Author

hmottestad
Apr 24, 2020
Maintainer Author

hmottestad
Apr 24, 2020
Maintainer Author

hmottestad
Apr 24, 2020
Maintainer Author

abrokenjester
Apr 25, 2020
Maintainer

abrokenjester
Apr 25, 2020
Maintainer

hmottestad
Apr 25, 2020
Maintainer Author

hmottestad
Apr 30, 2020
Maintainer Author

abrokenjester
Apr 30, 2020
Maintainer

abrokenjester
Apr 30, 2020
Maintainer