[report] rdfind measured performance on (large) HDD #133

GerHobbelt · 2023-05-22T11:10:39Z

Reporting performance so others can use this as a(nother) base for time-taken estimates, when they intend to run rdfind on HDD-based storage.

(Thank you for creating and maintaining rdfind, BTW. Initial trials indicated this tool would be the fastest available when you want to find duplicates and replace them by hardlinks, i.e. no deleting any file and thus changing (visible) directory tree structures.)

Scenario / setting

This was a session running on older (backup/fallback) hardware (AMD A8 3850 CPU), consumer grade mobo (not dedicated server hardware, which is generally engineered for (very) heavy disk I/O loads) and was entirely I/O bound throughout the session.¹
FYI: 16GB RAM, target is a single BTRFS filesystem on a fresh 16TB Toshiba HDD ("spinning rust"). System is running latest Ubuntu/Mint (fetched & installed May 2nd, 2023AD).
rdfind version reported as 1.5.0, installed using apt-get.

Target HDD/FS has ~41.3M files, ranging from (lots of) smallish source files to very large InDesign project sources plus video recordings (raw & compressed), taking up almost the entire 16TB (minus a few GB: free space was reported as 4GB at start of session). All files have been copied onto this filesystem using sequential (non-parallel) copy/recovery scripts + tools, plus no deletes whatsoever, so no aggravating disk and/or file fragmentation is to be expected.

We only wish to dedup part of the HDD, as there are several file trees on there which are critical and not "okay" to suffer from possible SHA1 content hash collisions. (I know there's at least one spot in the selection where there MAY be present two PDFs which showcase SHA1 content collision, but it's fine if rdfind "deduplicates" those as I know where to get them. Hence rdfind is expected to at least find some duplicates.)

Target selection includes at least 2 deep git submodule trees, which are expected to have significant overlap: these are recovered from other (damaged!) drives prior to deduplication and estimated at about 2-3M filecount & ~800GB disk space cost each. We also expected some overlap in some of the other selected trees, including images, raw documents, Adobe InDesign project sources, (large) PDFs and possibly some video files.

Commandline (shown further below) was set up to "cherrypick" the recovered disks filetrees I considered "non-critical"; the rest is intentionally untouched by rdfind.

Timings; Results

Expected duration: very probably more than 24 hours, hopefully complete within a week (d.v.).²

Time taken: 3+ days (between 80-81 hours total duration; I did not run time on this one so the time was old-skool eyeballed & records made using pen, paper & wallclock).³

As can be seen from the console snapshot below:

total file count scanned: ~3M
total file size: ~900GB
initial duplicates count estimate: ~2.9M (this was rather surprising; I had expected a lower number, but then if you have 3+M files, what's the odds of two files having the same filesize? (almost) 100%.
actual duplicates (by SHA1 hash): ~1.8M (that was also rather more than expected. Upon closer inspection, it were the huge git submodule trees that contributed almost all final duplicates, as those recovered copies were only a few months apart, leaving most files in there unaltered and thus pure duplicates.)
actual "savings" by deduplication: ~300GB (~30%)

Additional observations

The scan of the selected targets (all subtrees in the same, single, BTRFS filesystem) certainly took less than half a day⁴; rdfind produced a duplicates report (results.txt, ~365MB uncompressed text) after about 25 hours.
The remainder of the time spent (2+ days) was spent on "Now making hard links.".

From some inspection sampling (plus the reported numbers of discards) I estimate all three "Now eliminating..." phases took about the same time. Theoretically the last phase (hashing) would've taken longer, but I did not measure those and given the incessant racket of the HDD, I assume most time was spent on moving HDD heads i.e. seek time. Linux reported about 6GB of RAM being used for "disk caching", but that clearly wasn't enough to stop hammering in these phases as the OS may already have unloaded those file tree sectors after several hours of waiting for that second access attempt to arrive.

Anyway, I think it is therefore a reasonable assumption to split the time spent evenly across the four phases that led to the creation of results.txt as each is accessing the same number of files (insert hand-waving here; first scan is meta through file tree, while subsequent phases are file content accesses of different kinds, so the "divide by 4" suggestion can be argued easily against -- in the end, however, only the total time taken matters as you'll need that results.txt file or some other action to make it all worthwhile.

Results

performance until rdfind can report duplicates is about $\frac {3 \cdot 10^6} {25 \times 3600} = 33 \frac 1 3$ files per second (give or take a considerable margin⁵)

Given the number of duplicate candidates and the number of candidates dropped thanks to the compare-head and compare-tail content checks (~541K + ~60K), I MAY have had results in about 50% of the time by NOT performing those scans but immediately going for the hash calculus phase. Surely that's related to this particular set having lots of small-file duplicates in deep git source-code trees. Hence a "best possible" rate estimate would be about 60-70 files processed (scanned&reported) per second.

Hypothetically, IFF the scan would have produced zero candidates instead, only the initial scanning phase would have taken up time, which we ballparked at $\frac 1 4$ of 25 hours, hence a (hand-wavey) $\frac { \times 3 \cdot 10^6} {25 \times 3600} \approx 133$ files per second.

The actual deduplication phase itself (-makehardlink strue) took longer: $80 - 25 = 55$ hours, thus conversion to hardlinks had a performance of about $\frac {1790702} {55 \times 3600} \approx 9$ hardlink conversions per second. (Note that this is slower than creating hardlinks, as the duplicate files have to be deleted from disk as well to complete a -makehardlinks true action.

Conclusion

When you apply rdfind (or similar tools) to HDD (instead of SSD), don't bother about your CPU and/or parallel processing (#113, f.e.) as you'll be I/O bound almost everywhere. 12-year old CPU hardware didn't break out a sweat, while the (very recent) HDD was hammering like a MG42 at the beaches of Normandy.

For estimating deduplication time costs on (very) large file trees (millions of files) on consumer-grade HDDs, reckon with an estimate of about 133 files-per-second scanned; the actual number drops as candidates (based on file size) are detected and inspected.

If all you're after is a report of the duplicates found (in results.txt), the estimate can drop to a near-worst-case 33 files per second if you have lots of duplicates / duplicate candidates.

If you want to deduplicate using hardlinks, processing rates MAY drop to a near-worst-case 9 files per second.

These latter "near worst case" estimates are based on a single BTRFS filesystem, the selected file trees showing a 60% deduplication by conversion to hardlinks, while 75% of the file tree was reported as being actual duplicates.

... hence $2243575 - 1790702 = 452873$ unique files had, on average, $\frac {2243575} { 2243575 - 1790702 } - 1 \approx 4$ duplicates!

When you're going to process millions of files, you MAY find that any OS filesystem caching is, ah, inadequate and you're probably hampered more by disk seek times than anything else.⁶

All numbers apply to running rdfind on consumer-grade hardware and HDD(s); SSDs and dedicated server hardware are surely able to reach noticeably higher processing rates. 😇

root@ger-GA-A75M-UD2H:/media/ger/16TB-linux001# rdfind -ignoreempty true -minsize 100 -makehardlinks true         4TB* D-SSD-Projects/ C-SSD-Boot/ EX-Google.Drive-NOT-SYNCED/ GoogleDrive-copy2023/ OneDrive-copy2023/ Win7B*
Now scanning "4TB00M00", found 2654 files.
Now scanning "4TB0A1", found 5997 files.
Now scanning "D-SSD-Projects", found 1782039 files.
Now scanning "C-SSD-Boot", found 1277644 files.
Now scanning "EX-Google.Drive-NOT-SYNCED", found 1405 files.
Now scanning "GoogleDrive-copy2023", found 5723 files.
Now scanning "OneDrive-copy2023", found 8649 files.
Now scanning "Win7Boot", found 1893 files.
Now scanning "Win7BtPt2", found 229 files.
Now have 3086233 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 971808058157 bytes or 905 GiB
Removed 106914 files due to unique sizes from list. 2979319 files left.
Now eliminating candidates based on first bytes: removed 541410 files from list. 2437909 files left.
Now eliminating candidates based on last bytes: removed 60268 files from list. 2377641 files left.
Now eliminating candidates based on sha1 checksum: removed 134066 files from list. 2243575 files left.
It seems like you have 2243575 files that are not unique
Totally, 296 GiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 1790702 links.
root@ger-GA-A75M-UD2H:/media/ger/16TB-linux001# df

I didn't worry about Ways to speed up rdfind runs: Parallelization #113, f.e.; CPU load was reasonable (never above 30% for any core, while the machine was lightly used by other software during the session, except when the hashes were being calculated, but given the observations (by finely tuned eyeball and ear) and the enduring racket produced by the drive for the other, ah, 99%, of the time, I can't say CPU power was a significant contributor to these timing results. ↩
that was the very rough estimate after a couple of initial rdfind runs on a much smaller subset. It was expected rdfind would be I/O bound from those initial observations. ↩
why the 1 hour tolerance on the timing? Two reasons: (1) to overcompensate for any human observation imprecision at start & end of session, and related (2) the end of the session was pretty obvious as the HDD stopped making a racket, but I lost precious seconds 🤡 wondering if it really was true rdfind had completed its run. 🤔 But let's be real: what's definitely-less-than-5-minutes on a total run of 80+ hours? Shall we call that, ah, 1 promille? 🤝 ↩
not eyeballed precisely; I missed the transition to "Now eliminating..."; the HDD kept up the racket all the way through. I only noticed it was further down the process as I started hearing short eruptions of "silence": that's when rdfind had got to max out a CPU core hunting for SHA1 checksums to compare. ↩
the ⅓ is just a funny coincidence of that particular sum. ↩
veering widely off topic: this is also indicative why databases love to pull the entire database into RAM to keep optimum performance throughout the day. I'm looking at you, Oracle and MS SQLServer! 😄 ↩

The text was updated successfully, but these errors were encountered:

GerHobbelt · 2023-05-22T12:08:09Z

For Optimizers / Developers

So this scenario had no use for #113 at all. But what MIGHT have been done?

Given the "almost 75% of the files scanned has 1 or more duplicates" plus "about 60% of 'em are dupes" could have been found in about -- <licks thumb and senses the breeze> -- half the time when the check-head and check-tail content check phases would/could have been eliminated. After all, for this particular set, those cut ~3M down to ~2.2M, so that's about 25% gain only for a noticable additional HDD access cost.

Maybe a handy commandline option to add: skip phase A+B, only do C?

But wait! What's the expected impact of such an optimization, when we do not just want a report but wish to do something about it, e.g. replace duplicates by hardlinks -- arguably one of the most costly dedup operations: it's a create-hardlink + file-delete action wrapped into one, so plenty of filesystem I/O per conversion?

Optimistically, I'ld say you'ld have to assume a cost of one more 'scan time' at least even for the cheapest action (file-delete?), while the convert-to-hardlink turned out very costly here indeed.

It would, however, be safe to say that reducing the number of rounds going through the filesystem and/or accessing files CAN be beneficial; regrettably OSes do not hand out APIs to optimize your disk access re head seek reduction/optimization, so the only way up is get everything onto SSD if you want to speed this up (seek times there are basically nil; conversion to hardlink however still requires some costly SSD page writes/clears).

In the end, is it worth it to deduplicate a large file set like this (millions of files)?

Given that only a small portion of the 41+M files on that HDD got processed in ~4 days, it might be opportune to invest in yet another HDD instead and postpone/do any mandatory deduplication in the end-point application only -- I was wondering about this as I have a large test corpus that is expected to have some (south of 10%) duplicates... 🤔 It might be "smart" to only detect&report the duplicates via rdfind -- as that is the fastest way to get intel on these, by a large margin -- and then postprocess that results.txt into something that can be parsed and used by the applications involved in processing such data.

Of course, the thriftness-lobe in my brain is on the barricades against this next revolting thought, but it may be unwise to bother deduplicating (very) large file sets, unless you're at least moderately sure the deduplication will produce significant space-saving gains, say 50% or better. Otherwise you might be better off begging your employer for more funding for your permanent storage addiction.
I had intended rdfind to check and dedup only files 100MByte or larger by specifying -minsize 100M but that apparently got parsed as -minsize 100 without an error or warning notice (bug? #134), a fact I sneakily hid in the console snapshot shown. 😉

GerHobbelt · 2023-05-22T13:08:42Z

Related: #114

GerHobbelt mentioned this issue May 22, 2023

-minsize 100M is parsed as -minsize 100 without a peep #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[report] rdfind measured performance on (large) HDD #133

[report] rdfind measured performance on (large) HDD #133

GerHobbelt commented May 22, 2023

GerHobbelt commented May 22, 2023 •

edited

Loading

GerHobbelt commented May 22, 2023

[report] rdfind measured performance on (large) HDD #133

[report] rdfind measured performance on (large) HDD #133

Comments

GerHobbelt commented May 22, 2023

Scenario / setting

Timings; Results

Additional observations

Results

Conclusion

Footnotes

GerHobbelt commented May 22, 2023 • edited Loading

For Optimizers / Developers

In the end, is it worth it to deduplicate a large file set like this (millions of files)?

GerHobbelt commented May 22, 2023

GerHobbelt commented May 22, 2023 •

edited

Loading