lots of little polishing #2614

BurntSushi · 2023-09-26T19:05:46Z

This PR brings a lot of the code into line with my current style, which has slightly shifted over the years. I've also dropped dependencies where possible. For example, globset no longer depends on the fnv crate, and instead just rolls the FNV hashing itself. Similarly, the regex crate is dropped as a dependency entirely, and we instead depend on regex-automata directly. This leads to the interesting result that ripgrep no longer depends on regex anywhere in its tree(!), but it is still most convenient to say that it uses the regex crate when talking about the kind of regex flavor it supports.

Not much here. Just updating to reflect my current style and bringing the crate to the 2021 edition.

I think I already did a clean-up of this crate when I moved it to regex 1.9, so the polish here is very minor.

And use rustdoc's native intra-crate links. So much nicer.

Closes #2612

crates/core/app.rs

This brings the code in line with my current style. It also inlines the dozen or so lines of code for FNV hashing instead of bringing in a micro-crate for it. Finally, it drops the dependency on regex in favor of using regex-syntax and regex-automata directly.

This updates some dependencies and brings code style in line with my current practice.

In the time before, we just used a RegexSet from the regex crate. That allocated unconditionally, so there was nothing we could do and it didn't expose any APIs to reuse that memory. But now that we're using the lower level regex-automata, we can reuse a PatternSet. Ideally we would just provide a way for the caller to build a PatternSet (perhaps via an opaque type) so that we don't have to shuffle data into a PatternSet and then back into the caller's `Vec<usize>`. But this at least avoids allocating for every search.

Like previous commits, we do a bit of polishing and bring the style up to my current practice.

This is largely made possible by the addition of std::sync::OnceLock to the standard library, and the memory pool available in regex-automata.

This I believe finishes are quest to do mechanical updates to ripgrep's style, bringing it in line with my current practice (loosely speaking).

It was apparently using a format specific to a particular plugin. I did know that, but apparently the plugin is not ubiquitous or de facto standard[1]. Thus, including it I think just leads to more confusion. We definitely do not want to be in the business of bundling aliases for every conceivable plugin to different editors, so just drop it. We expose the ability to write your own format for exactly this sort of reason. [1]: #2611 (comment)

ripgrep does not, and likely never will, report which pattern matched. Because of that, we can dedup the patterns via just their concrete syntax without any fuss. This is somewhat of a pathological case because you don't expect the end user to pass duplicate patterns in general. But if the end user generated a list of, say, names and did not dedup them, then ripgrep could end up spending a lot of extra time on those duplicates if there are many of them. By deduping them explicitly in the application, we essentially remove their extra cost completely.

This brings in a fix for a bug I found during ad hoc benchmarking: rust-lang/regex@aa4e4c7

These seem to help when ripgrep emits a lot of output, especially when the --column flag is used.

It seems like a trifle, but if the match frequency is high enough, the allocation+formatting of line numbers (and columns and byte offsets) starts to matter. We squash that part of the profile in this commit by doing our own decimal formatting. I speculate that we get a speed-up from this by avoiding the formatting machinery and also a possible allocation. An alternative would be to use the `itoa` crate, and it is indeed marginally faster in ad hoc benchmarks, but I'm satisfied enough with this solution.

Many of these functions should be inlineable, but I'm not 100% sure that they can be inlined without these annotations. We don't want to force things, but we do try and nudge the compiler in the right direction.

I did this in the course of trying to optimize it. I don't believe I made it any faster, but the refactoring led to code that I think is more readable.

This was probably fixed in a past commit where I bumped the regex engine to 1.9 (or perhaps more precisely, regex-automata 0.3). But I didn't track it as fixed at the time. Fixes #1275

Like the previous CHANGELOG entry, this marks a bug that was fixed likely with the introduction of regex 1.9: $ hyperfine "rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt" "rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt" Benchmark 1: rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt Time (mean ± σ): 1.034 s ± 0.011 s [User: 1.030 s, System: 0.004 s] Range (min … max): 1.021 s … 1.053 s 10 runs Benchmark 2: rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt Time (mean ± σ): 6.3 ms ± 0.3 ms [User: 4.6 ms, System: 1.6 ms] Range (min … max): 5.6 ms … 7.3 ms 343 runs Summary 'rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt' ran 164.95 ± 7.70 times faster than 'rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt' This was not fixed by making \b itself faster, but rather, by improving inner literal extraction. In particular, if the regex doesn't have any literals extracted, then search time can still be quite slow: $ time rg-13.0.0 -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 57538 real 0.427 user 0.423 sys 0.003 maxmem 46 MB faults 0 $ time rg -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 57538 real 0.337 user 0.333 sys 0.003 maxmem 46 MB faults 0 But then again, so is grep, because grep doesn't benefit from any literal optimizations either: $ time grep -E -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 62396 real 1.316 user 1.292 sys 0.007 maxmem 13 MB faults 7 The count mismatch should probably be investigated. Fixes #1760

This was fixed a few commits ago when we updated to regex-automata 0.4 (regex 1.10). Fixes #2623

I'm too lazy to fixup old commits.

BurntSushi and others added 5 commits September 25, 2023 16:48

matcher: polish the grep-matcher crate

5c61f4b

Not much here. Just updating to reflect my current style and bringing the crate to the 2021 edition.

pcre2: small polishing

7a78b02

regex: some minor polish

475e582

I think I already did a clean-up of this crate when I moved it to regex 1.9, so the polish here is very minor.

crates: remove hard-coded links

7dfbcf3

And use rustdoc's native intra-crate links. So much nicer.

printer: fix a few issues in the hyperlink docs

1dc1e8a

Closes #2612

ltrzesniewski reviewed Sep 26, 2023

View reviewed changes

crates/core/app.rs Outdated Show resolved Hide resolved

BurntSushi added 2 commits September 26, 2023 17:02

core: doc and logging touchups

c11c833

globset: polishing

d7c4277

This brings the code in line with my current style. It also inlines the dozen or so lines of code for FNV hashing instead of bringing in a micro-crate for it. Finally, it drops the dependency on regex in favor of using regex-syntax and regex-automata directly.

BurntSushi force-pushed the ag/polish branch from b440b99 to d7c4277 Compare September 26, 2023 21:02

BurntSushi added 8 commits September 26, 2023 17:15

grep: update to the 2021 edition

b00f244

searcher: polish

9d1e011

This updates some dependencies and brings code style in line with my current practice.

ignore: polish

637f383

Like previous commits, we do a bit of polishing and bring the style up to my current practice.

deps: drop thread_local, lazy_static and once_cell

5f6066e

This is largely made possible by the addition of std::sync::OnceLock to the standard library, and the memory pool available in regex-automata.

deps: bump what we can

2a5770a

core: polish the core of ripgrep

13e7788

This I believe finishes are quest to do mechanical updates to ripgrep's style, bringing it in line with my current practice (loosely speaking).

BurntSushi force-pushed the ag/polish branch from 3d01666 to 70b0936 Compare September 28, 2023 21:13

BurntSushi added 12 commits September 30, 2023 08:17

deps: bump regex and regex-automata

6499659

This brings in a fix for a bug I found during ad hoc benchmarking: rust-lang/regex@aa4e4c7

printer: sprinkle in a few #[inline] annotations

f24db84

These seem to help when ripgrep emits a lot of output, especially when the --column flag is used.

matcher: add a bunch of inline annotations

c0ca832

Many of these functions should be inlineable, but I'm not 100% sure that they can be inlined without these annotations. We don't want to force things, but we do try and nudge the compiler in the right direction.

searcher: simplify 'replace_bytes' routine

6c6f0ba

I did this in the course of trying to optimize it. I don't believe I made it any faster, but the refactoring led to code that I think is more readable.

progress

77529fb

deps: update various crates

25b9159

changelog: add bug fix for \b

36ba5fb

This was probably fixed in a past commit where I bumped the regex engine to 1.9 (or perhaps more precisely, regex-automata 0.3). But I didn't track it as fixed at the time. Fixes #1275

changelog: add another note about -w/--word-regexp bugs

300671b

This was fixed a few commits ago when we updated to regex-automata 0.4 (regex 1.10). Fixes #2623

cargo: remove dependency patches

8802218

I'm too lazy to fixup old commits.

BurntSushi merged commit 7099e17 into master Oct 10, 2023
14 checks passed

BurntSushi deleted the ag/polish branch October 10, 2023 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lots of little polishing #2614

lots of little polishing #2614

BurntSushi commented Sep 26, 2023

lots of little polishing #2614

lots of little polishing #2614

Conversation

BurntSushi commented Sep 26, 2023