-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mfs: clean cache on sync #751
base: main
Are you sure you want to change the base?
Conversation
There have been multiple reports of people complaining that MFS slows down to a crawl and triggers OOM events when adding many items to a folder. One underlying issue is that the MFS Directory has a cache which can grown unbounded. All items added to a directory are cached here even though they are already written to the underlying Unixfs folder on mfs writes. The cache serves Child() requests only. The purpose is the a Directory can then give updated information about its children without being "up to date" on its own parent: when a new write happens to a folder, the parent folder needs to be notified about the new unixfs node of their child. With this cache, this notification can be delayed until a Flush() is requested, one of the children triggers a bubbling notification, or the parent folder is traversed. When the parent folder is traversed, it will write cached entries of its childen to disk, including that of the current folder. In order to write the current folder it has triggered a GetNode() on it, which in turn has written all the cached nodes in the current folder to disk. This cascades. In principle, leafs do not have children and should not need to be rewritten to unixfs, but without caching the TestConcurrentWrites tests start failing. Caching provides a way for concurrent writes to access a single reference to a unixfs node, somehow making things work. Apart from cosmetic changes, this commits allows resetting the cache when it has been synced to disk. Other approaches have been tested, like fully removing the cache, or resetting it when it reaches certain size. All were insatisfactory in that MFS breaks in one way or other (i.e. when setting maxCacheSize to 1). Setting size to ~50 allows tests to pass, but just because no tests are testing with 50+1 items. This change does not break any tests, but after what I've seen, I can assume that concurrent writes during a Flush event probably result in broken expectations, and I wouldn't be surprised if there is a way to concatenate writes, lookups and flushes on references to multiple folders on the tree and trigger errors where things just work now. The main problem remains: the current setup essentially keeps all mfs in memory. This commit could allevaite the fact that reading the directory object triggers as many unixfs.AddChild() as nodes in the directory. After this, AddChild() would be triggered only on nodes cached since the last time. I don't find a reliable way of fixing MFS and I have discovered multiple gotchas into what was going to be a quick fix.
320bf6d
to
19419fb
Compare
@@ -99,10 +99,11 @@ func (d *Directory) localUpdate(c child) (*dag.ProtoNode, error) { | |||
d.lock.Lock() | |||
defer d.lock.Unlock() | |||
|
|||
err := d.updateChild(c) | |||
err := d.unixfsDir.AddChild(d.ctx, c.Name, c.Node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just subtituting one-liner functions for the actual call.
// Update child entry in the underlying UnixFS directory. | ||
func (d *Directory) updateChild(c child) error { | ||
err := d.unixfsDir.AddChild(d.ctx, c.Name, c.Node) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
return nil | ||
} | ||
|
||
func (d *Directory) Type() NodeType { | ||
return TDir | ||
} | ||
|
||
// childNode returns a FSNode under this directory by the given name if it exists. | ||
// it does *not* check the cached dirs and files | ||
func (d *Directory) childNode(name string) (FSNode, error) { | ||
nd, err := d.childFromDag(name) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
return d.cacheNode(name, nd) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one-liners removed.
if clean { | ||
d.entriesCache = make(map[string]FSNode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only relevant change.
Codecov ReportAll modified and coverable lines are covered by tests ✅
@@ Coverage Diff @@
## main #751 +/- ##
==========================================
- Coverage 60.40% 60.35% -0.05%
==========================================
Files 245 245
Lines 31063 31054 -9
==========================================
- Hits 18764 18744 -20
- Misses 10629 10637 +8
- Partials 1670 1673 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT. One question.
} | ||
|
||
func (d *Directory) sync() error { | ||
func (d *Directory) syncCache(clean bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where we would not want clean = true
? Seems like we should always clear the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I wanted to not clean on GetNode() while cleaning on reaching maxCacheSize.
I am not sure this PR is good in general... probably there should be some refcounting about which objects can be pushed out of the cache. My feeling based on how tests misbehave when removing things is at least that files that are open for writing should probably stay...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there's also the issue that this still allows the cache to grow forever until sync. And such sync may take minutes or hours...
@@ -22,6 +22,8 @@ The following emojis are used to highlight certain changes: | |||
|
|||
### Fixed | |||
|
|||
* `mfs`: directory cache is now cleared every time the directory node is read, somewhat limiting unbounded growth and time to sync it to the underlying unixfs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm scared to replace known MFS problems with new unknown MFS problems, but also not familiar with codebase well enough to have informed opinion :(
I know folks have pipelines that do a lot of concurrent writes to MFS via ipfs files
CLI/RPC and likely figured out what works and what does not.
Are we sure new version removes or at least decreases surface for data loss during concurrent writes?
Perhaps useful context: in kubo, we have --flush flag that (if true) calls mfs.FlushPath
after each MFS operation.
Intention behind disabling flush is to allow user to avoid sync during many small operations and flush to disk until "final" version of file or directory is ready, but that comes at consistency risk if there is more than one thread doing mutation:
Most of the subcommands of ipfs files
accept the --flush
flag. It defaults
to true. Use caution when setting this flag to false. It will improve
performance for large numbers of file operations, but it does so at the cost
of consistency guarantees.
So effectively, patching a lot of things should be faster with flush=false but you need to do things sequentially. This is the mental model Kubo users have.
Are edge cases mentioned in https://github.com/ipfs/boxo/pull/751/files avoidable with --flush=true
? Or are we introducing racy risk there, where it was not before?
If we make "safe" default less "safe", could/should we tie cache cleanup to happen only in mfs.FlushPath instead (explicit --flush=true
)? (call syncCache
with true
equivalent only if in the context of explicit mfs.FlushPath
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be an option...
I did not know we flushed by default... that also means this commit will essentially kill the cache on every operation. I cannot say with certainty but this probably brings the new "unknown MFS problems" as it will be equivalent to disabling the cache.
Let's do something, let me spend another while trying to figure out why exactly concurrent writes to a single file break with things like cache-size=1 and if something else is broken that way. Maybe I can find a workaround to feel better about it.
This helps/solves a known problem, so this change does have positive value. Can we run this is cluster or somewhere to gain more confidence that there are not any negative behaviors with this change? |
There have been multiple reports of people complaining that MFS slows down to a crawl and triggers OOM events when adding many items to a folder.
One underlying issue is that the MFS Directory has a cache which can grown unbounded. All items added to a directory are cached here even though they are already written to the underlying Unixfs folder on mfs writes.
The cache serves Child() requests only. The purpose is the a Directory can then give updated information about its children without being "up to date" on its own parent: when a new write happens to a folder, the parent folder needs to be notified about the new unixfs node of their child. With this cache, this notification can be delayed until a Flush() is requested, one of the children triggers a bubbling notification, or the parent folder is traversed.
When the parent folder is traversed, it will write cached entries of its childen to disk, including that of the current folder. In order to write the current folder it has triggered a GetNode() on it, which in turn has written all the cached nodes in the current folder to disk. This cascades. In principle, leafs do not have children and should not need to be rewritten to unixfs, but without caching the TestConcurrentWrites tests start failing. Caching provides a way for concurrent writes to access a single reference to a unixfs node, somehow making things work.
Apart from cosmetic changes, this commits allows resetting the cache when it has been synced to disk. Other approaches have been tested, like fully removing the cache, or resetting it when it reaches certain size. All were insatisfactory in that MFS breaks in one way or other (i.e. when setting maxCacheSize to 1). Setting size to ~50 allows tests to pass, but just because no tests are testing with 50+1 items.
This change does not break any tests, but after what I've seen, I can assume that concurrent writes during a Flush event probably result in broken expectations, and I wouldn't be surprised if there is a way to concatenate writes, lookups and flushes on references to multiple folders on the tree and trigger errors where things just work now.
The main problem remains: the current setup essentially keeps all mfs in memory. This commit could allevaite the fact that reading the directory object triggers as many unixfs.AddChild() as nodes in the directory. After this, AddChild() would be triggered only on nodes cached since the last time.
I don't find a reliable way of fixing MFS and I have discovered multiple gotchas into what was going to be a quick fix.