-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[report] rdfind measured performance on (large) HDD #133
Comments
For Optimizers / DevelopersSo this scenario had no use for #113 at all. But what MIGHT have been done? Given the "almost 75% of the files scanned has 1 or more duplicates" plus "about 60% of 'em are dupes" could have been found in about -- <licks thumb and senses the breeze> -- half the time when the check-head and check-tail content check phases would/could have been eliminated. After all, for this particular set, those cut ~3M down to ~2.2M, so that's about 25% gain only for a noticable additional HDD access cost. Maybe a handy commandline option to add: skip phase A+B, only do C? But wait! What's the expected impact of such an optimization, when we do not just want a report but wish to do something about it, e.g. replace duplicates by hardlinks -- arguably one of the most costly dedup operations: it's a create-hardlink + file-delete action wrapped into one, so plenty of filesystem I/O per conversion? Optimistically, I'ld say you'ld have to assume a cost of one more 'scan time' at least even for the cheapest action (file-delete?), while the convert-to-hardlink turned out very costly here indeed. It would, however, be safe to say that reducing the number of rounds going through the filesystem and/or accessing files CAN be beneficial; regrettably OSes do not hand out APIs to optimize your disk access re head seek reduction/optimization, so the only way up is get everything onto SSD if you want to speed this up (seek times there are basically nil; conversion to hardlink however still requires some costly SSD page writes/clears). In the end, is it worth it to deduplicate a large file set like this (millions of files)?Given that only a small portion of the 41+M files on that HDD got processed in ~4 days, it might be opportune to invest in yet another HDD instead and postpone/do any mandatory deduplication in the end-point application only -- I was wondering about this as I have a large test corpus that is expected to have some (south of 10%) duplicates... 🤔 It might be "smart" to only detect&report the duplicates via Of course, the thriftness-lobe in my brain is on the barricades against this next revolting thought, but it may be unwise to bother deduplicating (very) large file sets, unless you're at least moderately sure the deduplication will produce significant space-saving gains, say 50% or better. Otherwise you might be better off begging your employer for more funding for your permanent storage addiction. |
Related: #114 |
Reporting performance so others can use this as a(nother) base for time-taken estimates, when they intend to run
rdfind
on HDD-based storage.(Thank you for creating and maintaining
rdfind
, BTW. Initial trials indicated this tool would be the fastest available when you want to find duplicates and replace them by hardlinks, i.e. no deleting any file and thus changing (visible) directory tree structures.)Scenario / setting
This was a session running on older (backup/fallback) hardware (AMD A8 3850 CPU), consumer grade mobo (not dedicated server hardware, which is generally engineered for (very) heavy disk I/O loads) and was entirely I/O bound throughout the session.1
FYI: 16GB RAM, target is a single BTRFS filesystem on a fresh 16TB Toshiba HDD ("spinning rust"). System is running latest Ubuntu/Mint (fetched & installed May 2nd, 2023AD).
rdfind
version reported as1.5.0
, installed usingapt-get
.Target HDD/FS has ~41.3M files, ranging from (lots of) smallish source files to very large InDesign project sources plus video recordings (raw & compressed), taking up almost the entire 16TB (minus a few GB: free space was reported as 4GB at start of session). All files have been copied onto this filesystem using sequential (non-parallel) copy/recovery scripts + tools, plus no deletes whatsoever, so no aggravating disk and/or file fragmentation is to be expected.
We only wish to dedup part of the HDD, as there are several file trees on there which are critical and not "okay" to suffer from possible SHA1 content hash collisions. (I know there's at least one spot in the selection where there MAY be present two PDFs which showcase SHA1 content collision, but it's fine if
rdfind
"deduplicates" those as I know where to get them. Hencerdfind
is expected to at least find some duplicates.)Target selection includes at least 2 deep git submodule trees, which are expected to have significant overlap: these are recovered from other (damaged!) drives prior to deduplication and estimated at about 2-3M filecount & ~800GB disk space cost each. We also expected some overlap in some of the other selected trees, including images, raw documents, Adobe InDesign project sources, (large) PDFs and possibly some video files.
Commandline (shown further below) was set up to "cherrypick" the recovered disks filetrees I considered "non-critical"; the rest is intentionally untouched by
rdfind
.Timings; Results
Expected duration: very probably more than 24 hours, hopefully complete within a week (d.v.).2
Time taken: 3+ days (between 80-81 hours total duration; I did not run
time
on this one so the time was old-skool eyeballed & records made using pen, paper & wallclock).3As can be seen from the console snapshot below:
Additional observations
The scan of the selected targets (all subtrees in the same, single, BTRFS filesystem) certainly took less than half a day4;
rdfind
produced a duplicates report (results.txt
, ~365MB uncompressed text) after about 25 hours.The remainder of the time spent (2+ days) was spent on "
Now making hard links.
".From some inspection sampling (plus the reported numbers of discards) I estimate all three "
Now eliminating...
" phases took about the same time. Theoretically the last phase (hashing) would've taken longer, but I did not measure those and given the incessant racket of the HDD, I assume most time was spent on moving HDD heads i.e. seek time. Linux reported about 6GB of RAM being used for "disk caching", but that clearly wasn't enough to stop hammering in these phases as the OS may already have unloaded those file tree sectors after several hours of waiting for that second access attempt to arrive.Anyway, I think it is therefore a reasonable assumption to split the time spent evenly across the four phases that led to the creation of
results.txt
as each is accessing the same number of files (insert hand-waving here; first scan is meta through file tree, while subsequent phases are file content accesses of different kinds, so the "divide by 4" suggestion can be argued easily against -- in the end, however, only the total time taken matters as you'll need thatresults.txt
file or some other action to make it all worthwhile.Results
performance until$\frac {3 \cdot 10^6} {25 \times 3600} = 33 \frac 1 3$ files per second (give or take a considerable margin5)
rdfind
can report duplicates is aboutGiven the number of duplicate candidates and the number of candidates dropped thanks to the compare-head and compare-tail content checks (~541K + ~60K), I MAY have had results in about 50% of the time by NOT performing those scans but immediately going for the hash calculus phase. Surely that's related to this particular set having lots of small-file duplicates in deep git source-code trees. Hence a "best possible" rate estimate would be about 60-70 files processed (scanned&reported) per second.
The actual deduplication phase itself ($80 - 25 = 55$ hours, thus conversion to hardlinks had a performance of about $\frac {1790702} {55 \times 3600} \approx 9$ hardlink conversions per second. (Note that this is slower than creating hardlinks, as the duplicate files have to be deleted from disk as well to complete a
-makehardlink strue
) took longer:-makehardlinks true
action.Conclusion
When you apply
rdfind
(or similar tools) to HDD (instead of SSD), don't bother about your CPU and/or parallel processing (#113, f.e.) as you'll be I/O bound almost everywhere. 12-year old CPU hardware didn't break out a sweat, while the (very recent) HDD was hammering like a MG42 at the beaches of Normandy.For estimating deduplication time costs on (very) large file trees (millions of files) on consumer-grade HDDs, reckon with an estimate of about 133 files-per-second scanned; the actual number drops as candidates (based on file size) are detected and inspected.
If all you're after is a report of the duplicates found (in
results.txt
), the estimate can drop to a near-worst-case 33 files per second if you have lots of duplicates / duplicate candidates.If you want to deduplicate using hardlinks, processing rates MAY drop to a near-worst-case 9 files per second.
These latter "near worst case" estimates are based on a single BTRFS filesystem, the selected file trees showing a 60% deduplication by conversion to hardlinks, while 75% of the file tree was reported as being actual duplicates.
When you're going to process millions of files, you MAY find that any OS filesystem caching is, ah, inadequate and you're probably hampered more by disk seek times than anything else.6
All numbers apply to running
rdfind
on consumer-grade hardware and HDD(s); SSDs and dedicated server hardware are surely able to reach noticeably higher processing rates. 😇Footnotes
I didn't worry about Ways to speed up rdfind runs: Parallelization #113, f.e.; CPU load was reasonable (never above 30% for any core, while the machine was lightly used by other software during the session, except when the hashes were being calculated, but given the observations (by finely tuned eyeball and ear) and the enduring racket produced by the drive for the other, ah, 99%, of the time, I can't say CPU power was a significant contributor to these timing results. ↩
that was the very rough estimate after a couple of initial
rdfind
runs on a much smaller subset. It was expectedrdfind
would be I/O bound from those initial observations. ↩why the 1 hour tolerance on the timing? Two reasons: (1) to overcompensate for any human observation imprecision at start & end of session, and related (2) the end of the session was pretty obvious as the HDD stopped making a racket, but I lost precious seconds 🤡 wondering if it really was true
rdfind
had completed its run. 🤔 But let's be real: what's definitely-less-than-5-minutes on a total run of 80+ hours? Shall we call that, ah, 1 promille? 🤝 ↩not eyeballed precisely; I missed the transition to "
Now eliminating...
"; the HDD kept up the racket all the way through. I only noticed it was further down the process as I started hearing short eruptions of "silence": that's whenrdfind
had got to max out a CPU core hunting for SHA1 checksums to compare. ↩the ⅓ is just a funny coincidence of that particular sum. ↩
veering widely off topic: this is also indicative why databases love to pull the entire database into RAM to keep optimum performance throughout the day. I'm looking at you, Oracle and MS SQLServer! 😄 ↩
The text was updated successfully, but these errors were encountered: