jit: Replace LFU with LRU cache replacement policy #518

vacantron · 2024-12-02T09:35:28Z

The LFU (least-frequently-used) cache replacement policy would be inefficient when the cache is not big enough to hold all translated blocks (IRs). For example, there are two new entries going to be inserted to cache. They would evict each other because the other one has the least used times. This could cause the inefficiency in cache and make the cache-hit ratio fallen.

The LRU (least-recently-used) cache replacement policy is more suitable for our needs. However, when the least used cache is evicted from the cache, the counter which is used to trigger JIT compilation is going to be lost.

To address this issue, this patch introduces the degenerated adaptive replacement cache (ARC) which has only LRU part and its ghost list. The evicted entries will be temporarily preserved in the ghost list as the history. If the key of inserted entry matches the one in the ghost list, the history will be freed, and the frequency mentioned above would be inherited by the new entry.

The performance difference between the original implementation and this patch is shown below:

1024 entries (10 bits)

Metric	Original	Patched
dhrystone	17720 DMIPS	17660 DMIPS
coremark	5540 iters/s	5550 iters/s
aes	9.164 s	9.084 s
nqueens	1.551 s	1.493 s
hamilton	12.917 s	12.565 s

256 entries (8 bits)

Metric	Original	Patched
dhrystone	17420 DMIPS	18025 DMIPS
coremark	48 iters/s	5575 iters/s
aes	8.904 s	8.834 s
nqueens	6.416 s	1.348 s
hamilton	3400 s	13.004 s

64 entries (6 bits)

Metric	Original	Patched
dhrystone	17720 DMIPS	17850 DMIPS
coremark	(timeout)	215 iters/s
aes	342 s	8.882 s
nqueens	680 s	1.506 s
hamilton	(timeout)	13.724 s

Experimental Linux Kernel booting: 126s (original) -> 21s (patched)

jserv

Benchmarks

Benchmark suite	Current: `5ffb157`	Previous: `448f434`	Ratio
`Dhrystone`	`1512` Average DMIPS over 10 runs	`1556` Average DMIPS over 10 runs	`1.03`
`Coremark`	`1399.459` Average iterations/sec over 10 runs	`1400.104` Average iterations/sec over 10 runs	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

vacantron · 2024-12-02T09:42:40Z

@jserv , can we suppress the warnings from scan-build?

https://github.com/sysprog21/rv32emu/actions/runs/12116509397/job/33777032051?pr=518

jserv · 2024-12-02T09:46:17Z

can we suppress the warnings from scan-build?
https://github.com/sysprog21/rv32emu/actions/runs/12116509397/job/33777032051?pr=518

Since both 'list' and 'hlist' are heavily tested, we can suppress the potential false alarm raised by LLVM static analysis.

Makefile

src/cache.c

src/emulate.c

Makefile

visitorckw · 2024-12-02T11:29:40Z

The LFU (least-frequently-used) replacement policy would encounter the performance issue when the cache is not big enough to hold all translated blocks (IRs). The LRU (least-recently-used) replacement policy is more suitable for our needs. However, when the least-used cache is dropped, the counter which is used to trigger JIT compilation is going to be cleared.

To address this issue, we store the "live" cache in both list and hash map, and manage the history of the replaced cache in hash map. Since the out-of-use cache is added again, it can retrieve the previous counter. However, the hash map might become very big if we don't trim the unused history in it. We use the distance with the current counter and the last accessed time to determine the history should be dropped or not.

The PR description and the first patch mention performance issues but provide no benchmark numbers. They also state that LRU fits our needs better than LFU without explaining the workload or why it is more suitable. This contradicts the conclusion of commit bdc5348 ("Implement LFU as default cache along with memory pool (#125)") without providing an explanation.

Could you update the descriptions to address these issues?

vacantron · 2024-12-02T11:40:11Z

Could you update the descriptions to address these issues?

Sure, I will provide some benchmarks to indicate the performance difference on the upcoming changes.

src/cache.c

jserv

Clarify the variant of LRU.

src/cache.c

tests/cache/test-cache.c

src/cache.c

ChinYikMing · 2024-12-03T08:06:48Z

src/cache.c

+    if (!revived_entry) {
+        new_entry->frequency = 1;
+    } else {
+        new_entry->frequency = revived_entry->frequency + 1;


nit: use revived_entry->frequency++ for shorter.

Thess two line are not logically equivalent. I think this may be somewhat misleading.

Thess two line are not logically equivalent. I think this may be somewhat misleading.

My bad. It should be ++revived_entry->frequency. Pre-increment has same effect of revived_entry->frequency + 1.

Let me clarify the only changes is the rhs.

The revived_entry would be read-only and freed after the stored information is inherited by the new entry.

src/cache.c

ChinYikMing

Reorder the commit order such that squashing Update cache tests with Replace LFU with LRU cache replacement policy or arrange them as consecutive commit because these two commits are highly relevant . This ensures that the commit log remains clear and easy to understand.
Commit message typo in first commit: Matric -> Metric.

src/cache.c

The LFU (least-frequently-used) cache replacement policy would be inefficient when the cache is not big enough to hold all translated blocks (IRs). For example, there are two new entries going to be inserted to cache. They would evict each other because the other one has the least used times. This could cause the inefficiency in cache and make the cache-hit ratio fallen. The LRU (least-recently-used) cache replacement policy is more suitable for our needs. However, when the least used cache is evicted from the cache, the counter which is used to trigger JIT compilation is going to be lost. To address this issue, this patch introduces the degenerated adaptive replacement cache (ARC) which has only LRU part and its ghost list. The evicted entries will be temporarily preserved in the ghost list as the history. If the key of inserted entry matches the one in the ghost list, the history will be freed, and the frequency mentioned above would be inherited by the new entry. The performance difference between the original implementation and this patch is shown below: * 1024 entries (10 bits) | Metric | Original | Patched | |-----------|----------------|----------------| | dhrystone | 17720 DMIPS | 17660 DMIPS | | coremark | 5540 iters/s | 5550 iters/s | | aes | 9.164 s | 9.084 s | | nqueens | 1.551 s | 1.493 s | | hamilton | 12.917 s | 12.565 s | * 256 entries (8 bits) | Metric | Original | Patched | |-----------|----------------|----------------| | dhrystone | 17420 DMIPS | 18025 DMIPS | | coremark | 48 iters/s | 5575 iters/s | | aes | 8.904 s | 8.834 s | | nqueens | 6.416 s | 1.348 s | | hamilton | 3400 s | 13.004 s | * 64 entries (6 bits) | Metric | Original | Patched | |-----------|----------------|----------------| | dhrystone | 17720 DMIPS | 17850 DMIPS | | coremark | (timeout) | 215 iters/s | | aes | 342 s | 8.882 s | | nqueens | 680 s | 1.506 s | | hamilton | (timeout) | 13.724 s | * Experimental Linux Kernel booting: 126s (original) -> 21s (patched)

T2C uses the translated blocks (IRs) to establish the LLVM-IR. However, the cache might be modified by the main thread while the background thread is compiling.

jserv · 2024-12-03T17:17:52Z

Thank @vacantron for contributing!

jit: Replace LFU with LRU cache replacement policy

vacantron requested review from jserv and ChinYikMing December 2, 2024 09:36