Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jit: Replace LFU with LRU cache replacement policy #518

Merged
merged 4 commits into from
Dec 3, 2024

Conversation

vacantron
Copy link
Collaborator

@vacantron vacantron commented Dec 2, 2024

The LFU (least-frequently-used) cache replacement policy would be inefficient when the cache is not big enough to hold all translated blocks (IRs). For example, there are two new entries going to be inserted to cache. They would evict each other because the other one has the least used times. This could cause the inefficiency in cache and make the cache-hit ratio fallen.

The LRU (least-recently-used) cache replacement policy is more suitable for our needs. However, when the least used cache is evicted from the cache, the counter which is used to trigger JIT compilation is going to be lost.

To address this issue, this patch introduces the degenerated adaptive replacement cache (ARC) which has only LRU part and its ghost list. The evicted entries will be temporarily preserved in the ghost list as the history. If the key of inserted entry matches the one in the ghost list, the history will be freed, and the frequency mentioned above would be inherited by the new entry.

The performance difference between the original implementation and this patch is shown below:

  • 1024 entries (10 bits)
Metric Original Patched
dhrystone 17720 DMIPS 17660 DMIPS
coremark 5540 iters/s 5550 iters/s
aes 9.164 s 9.084 s
nqueens 1.551 s 1.493 s
hamilton 12.917 s 12.565 s
  • 256 entries (8 bits)
Metric Original Patched
dhrystone 17420 DMIPS 18025 DMIPS
coremark 48 iters/s 5575 iters/s
aes 8.904 s 8.834 s
nqueens 6.416 s 1.348 s
hamilton 3400 s 13.004 s
  • 64 entries (6 bits)
Metric Original Patched
dhrystone 17720 DMIPS 17850 DMIPS
coremark (timeout) 215 iters/s
aes 342 s 8.882 s
nqueens 680 s 1.506 s
hamilton (timeout) 13.724 s
  • Experimental Linux Kernel booting: 126s (original) -> 21s (patched)

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks

Benchmark suite Current: 5ffb157 Previous: 448f434 Ratio
Dhrystone 1512 Average DMIPS over 10 runs 1556 Average DMIPS over 10 runs 1.03
Coremark 1399.459 Average iterations/sec over 10 runs 1400.104 Average iterations/sec over 10 runs 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@jserv jserv requested a review from henrybear327 December 2, 2024 09:41
@vacantron
Copy link
Collaborator Author

@jserv , can we suppress the warnings from scan-build?

https://github.com/sysprog21/rv32emu/actions/runs/12116509397/job/33777032051?pr=518

@jserv
Copy link
Contributor

jserv commented Dec 2, 2024

can we suppress the warnings from scan-build?
https://github.com/sysprog21/rv32emu/actions/runs/12116509397/job/33777032051?pr=518

Since both 'list' and 'hlist' are heavily tested, we can suppress the potential false alarm raised by LLVM static analysis.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
src/cache.c Outdated Show resolved Hide resolved
@jserv jserv added this to the release-2024.2 milestone Dec 2, 2024
src/emulate.c Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
@visitorckw
Copy link
Collaborator

The LFU (least-frequently-used) replacement policy would encounter the performance issue when the cache is not big enough to hold all translated blocks (IRs). The LRU (least-recently-used) replacement policy is more suitable for our needs. However, when the least-used cache is dropped, the counter which is used to trigger JIT compilation is going to be cleared.

To address this issue, we store the "live" cache in both list and hash map, and manage the history of the replaced cache in hash map. Since the out-of-use cache is added again, it can retrieve the previous counter. However, the hash map might become very big if we don't trim the unused history in it. We use the distance with the current counter and the last accessed time to determine the history should be dropped or not.

The PR description and the first patch mention performance issues but provide no benchmark numbers. They also state that LRU fits our needs better than LFU without explaining the workload or why it is more suitable. This contradicts the conclusion of commit bdc5348 ("Implement LFU as default cache along with memory pool (#125)") without providing an explanation.

Could you update the descriptions to address these issues?

@vacantron
Copy link
Collaborator Author

Could you update the descriptions to address these issues?

Sure, I will provide some benchmarks to indicate the performance difference on the upcoming changes.

src/cache.c Outdated Show resolved Hide resolved
@vacantron vacantron force-pushed the jit/cache branch 2 times, most recently from 266f269 to 05d31c1 Compare December 2, 2024 16:47
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify the variant of LRU.

src/cache.c Outdated Show resolved Hide resolved
src/cache.c Show resolved Hide resolved
src/cache.c Outdated Show resolved Hide resolved
tests/cache/test-cache.c Show resolved Hide resolved
tests/cache/test-cache.c Outdated Show resolved Hide resolved
tests/cache/test-cache.c Outdated Show resolved Hide resolved
src/cache.c Outdated Show resolved Hide resolved
src/cache.c Show resolved Hide resolved
src/cache.c Outdated Show resolved Hide resolved
src/cache.c Outdated
if (!revived_entry) {
new_entry->frequency = 1;
} else {
new_entry->frequency = revived_entry->frequency + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use revived_entry->frequency++ for shorter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thess two line are not logically equivalent. I think this may be somewhat misleading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thess two line are not logically equivalent. I think this may be somewhat misleading.

My bad. It should be ++revived_entry->frequency. Pre-increment has same effect of revived_entry->frequency + 1.

Let me clarify the only changes is the rhs.

Copy link
Collaborator Author

@vacantron vacantron Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revived_entry would be read-only and freed after the stored information is inherited by the new entry.

src/cache.c Show resolved Hide resolved
src/cache.c Show resolved Hide resolved
Copy link
Collaborator

@ChinYikMing ChinYikMing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Reorder the commit order such that squashing Update cache tests with Replace LFU with LRU cache replacement policy or arrange them as consecutive commit because these two commits are highly relevant . This ensures that the commit log remains clear and easy to understand.

  2. Commit message typo in first commit: Matric -> Metric.

@vacantron vacantron force-pushed the jit/cache branch 3 times, most recently from 7f1c234 to 448f434 Compare December 3, 2024 12:42
src/cache.c Outdated Show resolved Hide resolved
src/cache.c Outdated Show resolved Hide resolved
src/cache.c Show resolved Hide resolved
The LFU (least-frequently-used) cache replacement policy would be
inefficient when the cache is not big enough to hold all translated
blocks (IRs). For example, there are two new entries going to be
inserted to cache. They would evict each other because the other one
has the least used times. This could cause the inefficiency in cache
and make the cache-hit ratio fallen.

The LRU (least-recently-used) cache replacement policy is more suitable
for our needs. However, when the least used cache is evicted from the
cache, the counter which is used to trigger JIT compilation is going to
be lost.

To address this issue, this patch introduces the degenerated adaptive
replacement cache (ARC) which has only LRU part and its ghost list. The
evicted entries will be temporarily preserved in the ghost list as the
history. If the key of inserted entry matches the one in the ghost list,
the history will be freed, and the frequency mentioned above would be
inherited by the new entry.

The performance difference between the original implementation and this
patch is shown below:

* 1024 entries (10 bits)

| Metric    |    Original    |     Patched    |
|-----------|----------------|----------------|
| dhrystone |  17720 DMIPS   |  17660 DMIPS   |
| coremark  |   5540 iters/s |   5550 iters/s |
| aes       |  9.164 s       |  9.084 s       |
| nqueens   |  1.551 s       |  1.493 s       |
| hamilton  | 12.917 s       | 12.565 s       |

* 256 entries (8 bits)

| Metric    |    Original    |     Patched    |
|-----------|----------------|----------------|
| dhrystone |  17420 DMIPS   |  18025 DMIPS   |
| coremark  |     48 iters/s |   5575 iters/s |
| aes       |  8.904 s       |  8.834 s       |
| nqueens   |  6.416 s       |  1.348 s       |
| hamilton  |   3400 s       | 13.004 s       |

* 64 entries (6 bits)

| Metric    |    Original    |     Patched    |
|-----------|----------------|----------------|
| dhrystone |  17720 DMIPS   |  17850 DMIPS   |
| coremark  |   (timeout)    |    215 iters/s |
| aes       |    342 s       |  8.882 s       |
| nqueens   |    680 s       |  1.506 s       |
| hamilton  |   (timeout)    | 13.724 s       |

* Experimental Linux Kernel booting: 126s (original) -> 21s (patched)
T2C uses the translated blocks (IRs) to establish the LLVM-IR. However,
the cache might be modified by the main thread while the background
thread is compiling.
@jserv jserv merged commit 451f8c0 into sysprog21:master Dec 3, 2024
8 checks passed
@jserv
Copy link
Contributor

jserv commented Dec 3, 2024

Thank @vacantron for contributing!

@vacantron vacantron deleted the jit/cache branch December 3, 2024 18:43
vestata pushed a commit to vestata/rv32emu that referenced this pull request Jan 24, 2025
jit: Replace LFU with LRU cache replacement policy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants