Improve performance of finding indexables #2082

natematykiewicz · 2024-05-22T21:55:34Z

Currently, all folders and files in the current tree are turned into IndexablePath, and then excluded files are filtered out after.

When there are large file trees that are meant to be excluded, this results in a lot of unnecessary work.

ActiveStorage stores files in the tmp directory in many many small folders. So does Bootsnap. Ruby LSP has to traverse all of these files, even though the entire directory should just be ignored.

Rubocop has solved this by breaking the includes patterns up into many patterns, applying the exclusions before the Dir.glob, so I followed in their footsteps. This works great for exclusions that end in "**/*". We still need to loop through all IndexablePath objects and see if they're excluded, in the case that an extension was provided on the excluded path, but this can cut down load time dramatically.

Before this PR in my Rails app, indexables took 76 seconds to run. Now it takes, 0.19 seconds. Before and after code both return the same exact file list.

Additionally, I added node_modules to the list of excluded trees, since that can be very large and never includes Ruby files.

I also removed the *.rb from the bundler path. Having a file extension on that means we need to scan all files. But we simply want to ignore the entire bundler path tree. I kind of wonder if we should always replace *.rb with *, to help people improve performance. Excluding an actual file name or partial file name makes sense. But excluding the only file extension we scan means we can do it faster by excluding the whole folder.

Motivation

Opening a ruby file caused my LSP server to print "Ruby LSP: indexing files" for 76 seconds at 0% before the progress bar starts moving.

Implementation

I knew that Rubocop has solved this problem before, so I looked at this file and followed what they did.

Automated Tests

I added tests for the new pattern exclude_pattern that gets used with fnmatch, while ensuring I didn't break any existing tests.

Manual Tests

I made a file that has both implementations of indexables on my computer. Then ran this in the Rails Console:

Benchmark.measure { RubyIndexer::Configuration.new.indexables }
=>
#<Benchmark::Tms:0x000000012a2dead0
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.19083199999295175,
 @stime=0.0784180000000001,
 @total=0.18055299999999974,
 @utime=0.10213499999999964>


Benchmark.measure { RubyIndexer::Configuration.new.old_indexables }
=>
#<Benchmark::Tms:0x000000012a374b20
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=76.83584400010295,
 @stime=13.666566,
 @total=14.943883,
 @utime=1.277317>

old_indexables = RubyIndexer::Configuration.new.old_indexables
indexables = RubyIndexer::Configuration.new.indexables

indexables.size == old_indexables.size
=> true

indexables.map(&:full_path).sort == old_indexables.map(&:full_path).sort
=> true

indexables.size
=> 13933

So here you can see that it finds the same ~14k files in 0.25% of the time.

natematykiewicz · 2024-05-23T17:14:23Z

I just resolved the merge conflict

vinistock

Great improvements 🚀

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

natematykiewicz · 2024-05-31T17:00:46Z

@vinistock I believe I've addressed all of your comments

natematykiewicz · 2024-06-06T17:17:32Z

@vinistock do you need anything else from me on this?

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

vinistock · 2024-06-07T13:40:28Z

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

+ relative_patterns = @excluded_patterns
+ .select { |p| p.end_with?("/**/*") }
+ .map { |p| p.delete_prefix("#{Dir.pwd}/") }


This code is assuming that every pattern provided by the user is inside Dir.pwd, but that may not be true and I don't think we should tie ourselves to that yet.

I think we would need to change the check here to verify that the pattern both ends with wildcards and starts with Dir.pwd or **.

Suggested change

relative_patterns = @excluded_patterns

.select { |p| p.end_with?("/**/*") }

.map { |p| p.delete_prefix("#{Dir.pwd}/") }

relative_patterns = @excluded_patterns

.select { |p| p.end_with?("/**/*") && (p.start_with?(Dir.pwd) || p.start_with?("**")) }

.each { |p| p.delete_prefix!("#{Dir.pwd}/") }

That makes sense, though I don't think we want the ! on delete_prefix. We're allocating a new array, but the strings are still the ones in @excluded_patterns, which means we'd be mutating the content of @excluded_patterns.

I added that change to the select line, but kept what I had on the map line, since I don't think we want to mutate @excluded_patterns' strings. Let me know if you think otherwise.
3a8c7fe

Just to clarify, included_patterns should be relative to Dir.pwd, right? apply_config doesn't automatically prefix it with Dir.pwd, and I don't see how users would realistically set Dir.pwd in the YAML, considering they'd all have the repo checked out to different folders. Feels like we need to be prepending Dir.pwd to every entry in included_patterns. It also feels like excluded_patterns should similarly be expected to be within Dir.pwd. I'm struggling to understand what files someone would be including / excluding that aren't in a repo or a gem. But from what I can tell, excluded_patterns refers solely to application code, not gems. You exclude entire gems, not paths within a gem.

Right now I'm feeling like include_patterns doesn't even work, unless the user provides an absolute path to the folder on disk, which isn't very portable. What's the use-case for this? Understanding will help me figure out how to best optimize this, because the goal is to avoid calling Dir.glob on folders that are excluded, which means we need to take the included patterns, apply the exclude patterns that end with wildcards against them (essentially making include_patterns a larger list of folders), to cut down on which folders Dir.glob scans.

That's good point. Now that I think of it, all patterns have to be relative to something, since you can clone a project in any directory.

I'll bring this up with the team, but I think we should indeed make everything relative to Dir.pwd.

Okay great, thanks!

If we're including File.join(Dir.pwd, "**", "*.rb") by default, all I can really think that people might be including is non-.rb ruby files, like a Gemfile or Guardfile. So that leaves me really wondering what people even use include_patterns for. When people specify include_patterns, it doesn't seem to remove the default pattern. So anything they're including is likely already in the existing pattern, unless it doesn't end with .rb or is outside of Dir.pwd.

Rubocop's Include config option is relative to Dir.pwd. And it just is meant for adding things like Gemfile or config.ru.

https://docs.rubocop.org/rubocop/configuration.html#unusual-files-that-would-not-be-included-by-default

This says that Rubocop by default:

Finds all Ruby source files under the current or other supplied directory. A Ruby source file is defined as a file with the .rb extension or a file with no extension that has a ruby shebang line as its first line.

https://github.com/rubocop/rubocop/blob/aee64e2d5d6615ef64803af52c1b82bf95b566ef/lib/rubocop/target_finder.rb#L33-L35

It accepts a directory because you can run the Rubocop CLI against a path relative to Dir.pwd (for example, just linting your models).

Hey @vinistock, where'd you guys land on this? Should I assume that all paths are relative to Dir.pwd? If so, should I update apply_config, so that include_patterns are prefixed with Dir.pwd?

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

vinistock · 2024-06-07T13:54:04Z

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

+ end
+
+ sig { params(base_directory: String).returns(T::Array[String]) }
+ def indexable_included_file_patterns(base_directory)


This method is quite complex. Can we add some comments explaining the steps that this takes and include an example of a pattern before and after?

natematykiewicz · 2024-06-08T05:55:39Z

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

+ patterns = indexable_included_file_patterns(Dir.pwd, merge_exclude_patterns).map! { |dir| File.join(dir, "*.rb") }
+ patterns = [File.join(Dir.pwd, "**/*.rb")] if patterns.empty?


I'm realizing I've totally forgotten about @included_patterns and just treated it as hardcoded.

Indeed. We accept users defining extra glob patterns to index.

vinistock · 2024-07-10T14:13:19Z

I brought this back to the team for discussion and we reached a consensus. We do not want to copy whatever RuboCop is doing. It's hard to tell if there is legacy code or decisions baked in there that are not relevant for the Ruby LSP.

In addition to that, the majority of the gains will come from ignoring the directories at the first level, so we can instead focus on that for the performance improvement.

Can we please switch to doing this instead:

Keep the node_modules exclusion
Combine the included and excluded patterns only at the first level of directories. Without trying to go all the way descending directories. Essentially, something like this:

def combined_patterns
  # code that will return a glob pattern with only the relevant directories combined
  # on the first level below Dir.pwd

  "**/{lib,app,whatever}/**/*.rb"
end

Remove anything related to symlinks and the descending of several directory levels

Remember, we will still need to keep the excluded check loop, since someone could be excluding files that are a few directories below (which is fine).

natematykiewicz · 2024-07-10T14:47:15Z

I brought this back to the team for discussion and we reached a consensus. We do not want to copy whatever RuboCop is doing. It's hard to tell if there is legacy code or decisions baked in there that are not relevant for the Ruby LSP.

In addition to that, the majority of the gains will come from ignoring the directories at the first level, so we can instead focus on that for the performance improvement.

Can we please switch to doing this instead:

Keep the node_modules exclusion

Combine the included and excluded patterns only at the first level of directories. Without trying to go all the way descending directories. Essentially, something like this:
def combined_patterns
  # code that will return a glob pattern with only the relevant directories combined
  # on the first level below Dir.pwd

  "**/{lib,app,whatever}/**/*.rb"
end
Remove anything related to symlinks and the descending of several directory levels

Remember, we will still need to keep the excluded check loop, since someone could be excluding files that are a few directories below (which is fine).

That sounds good! I think this will end up being a lot simpler. Avoiding traversing tmp and node_modules are going to be the biggest wins anyways. I doubt many people have some massive subdirectory that they want excluded (like app/foo/**/*).

natematykiewicz · 2024-07-16T17:13:01Z

@vinistock I just found a slight problem with only excluding top-level folders. So, we want to exclude the Bundler path if it's inside of the pwd. That's something the initialize method already does. Well, a common path for that is vendor/bundle. Ideally we'd avoid traversing into vendor/bundle, since it might be rather large, and we want to ignore it. But if we're only excluding top-level directories, then we'll have to traverse all the way through the gems, allocating IndexablePath objects for every Gem file, just to remove them in the indexables.reject! loop.

Thoughts? Do we just default exclude vendor/**/* too?

natematykiewicz · 2024-07-16T22:54:11Z

2nd question @vinistock. Where'd we land on treating included_patterns and excluded_patterns as relative to Dir.pwd? It doesn't make much sense to me that someone would put an absolute path in there, because that would only work on 1 computer. This decision heavily affects how I go about solving this.

vinistock · 2024-07-17T20:23:52Z

For the first question: I'm not sure we can always just exclude the entire vendor directory. Theoretically, there could be vendored dependencies not managed by Bundler in there. I would say let's start simple, with only the first level of directories and we can then follow up with an improvement. Maybe we can treat the BUNDLE_PATH inside the workspace as a special case.

For the second one, yeah, let's make all patterns relative to the workspace.

natematykiewicz · 2024-07-17T20:24:47Z

Sounds good, thanks!

andyw8 · 2024-08-08T16:08:56Z

I'll set this to draft until @natematykiewicz has chance to return to it.

natematykiewicz · 2024-08-08T16:50:48Z

Thanks @andyw8!

One thing that's been tripping me up is the lack of normalization to the paths.

Vini said a few comments above that everything can be assumed to be relative to the workspace. The default path is a full absolute path to the directory, but I believe users would be passing in relative paths.

Perhaps you guys could get all paths to either be relative paths or absolute paths, instead of both? Then I'd have a much easier time doing this PR. I feel like I'm having to make a lot of decisions trying to normalize these paths (do the instance variables hold relative or absolute paths?), and realizing it's probably out of scope of this performance improvement PR anyways.

natematykiewicz · 2024-08-08T16:53:08Z

That last comment was about both the @included_patterns and @excluded_patterns instance variables, which in the real world seem to have a mix of absolute and relative paths.

natematykiewicz · 2024-08-20T21:56:35Z

I just saw that #2424 was merged. That should make this PR much simpler.

natematykiewicz · 2024-09-25T19:50:10Z

@vinistock @andyw8 I just force pushed to this branch. I started over from main, and did what Vini said -- exclude top-level directories if the entire tree has been excluded (such as tmp/**/*).

This change actually made this all massively easier. No longer having to consider leading **/ on the excludes was really helpful.

All in all, this PR is much simpler now.

natematykiewicz · 2024-09-25T19:53:17Z

Note, since we're only excluding the very top-level directories, vendor/bundle still gets traversed even though it's my BUNDLE_PATH. As Vini alluded to, things get a lot more complicated if we try skip multiple layers to top-level directories (like vendor/bundle/**/*). I left a comment calling that out as well.

Currently, all folders and files in the current tree are turned into IndexablePath, and then excluded files are filtered out after. When there are large file trees that are meant to be excluded, this results in a lot of unnecessary work. ActiveStorage stores files in the `tmp` directory in many many small folders. So does Bootsnap. Additionally, node_modules can become quite large. Ruby LSP has to traverse all of these files, even though the entire directory should just be ignored. Instead we can skip any top-level directories whose paths have been excluded. We still need to loop through all IndexablePath objects compare them to the exclude_patterns, in case nested folders or file name patterns were excluded. Still, skipping some large top-level directories proves to be a big performance improvement. Before this PR in my Rails app, `indexables` took 76 seconds to run. Now it takes, 0.17 seconds. Before and after code both return the same exact file list.

vinistock

This is looking great

vinistock · 2024-10-04T19:45:26Z

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

+ included_paths = Dir.glob(File.join(@workspace_path, "*/"), flags)
+ .reject { |dir| File.fnmatch?(merged_excluded_file_pattern, dir, flags) }
+ .map do |included_path|
+ relative_path = included_path
+ .delete_prefix(@workspace_path)
+ .tap { |path| path.delete_prefix!("/") }
+ .tap { |path| path.delete_suffix!("/") }
+
+ [included_path, relative_path]
+ end


Two things about this section:

We can avoid looping twice by using filter_map

We should invoke merged_excluded_file_pattern only once. It's currently invoking it once per directory, which is more expensive than necessary

Also, why do we need to delete the / suffix twice?

Suggested change

included_paths = Dir.glob(File.join(@workspace_path, "*/"), flags)

.reject { |dir| File.fnmatch?(merged_excluded_file_pattern, dir, flags) }

.map do |included_path|

relative_path = included_path

.delete_prefix(@workspace_path)

.tap { |path| path.delete_prefix!("/") }

.tap { |path| path.delete_suffix!("/") }

[included_path, relative_path]

end

excluded_pattern = merged_excluded_file_pattern

included_paths = Dir.glob(File.join(@workspace_path, "*/"), flags)

.filter_map do |included_path|

next if File.fnmatch?(excluded_pattern, dir, flags)

relative_path = included_path

.delete_prefix(@workspace_path)

.tap { |path| path.delete_prefix!("/") }

.tap { |path| path.delete_suffix!("/") }

[included_path, relative_path]

end

Oh, right! c16b362 and 050ee96

Also, why do we need to delete the / suffix twice?

It's a prefix and a suffix.

So for example, with a @workspace_path of /my/workspace/path. An included_path might then be /my/workspace/path/test/. I'm trying to make it become test to be a relative path. Deleting the workspace path leaves us with /test, so then I delete the prefix and suffix / to get it down to test. Then I could see whether the included path starts with that. Though, thinking about it more, I think I want the suffix /. We want to make sure that it starts with test/ instead test matching some test_foo path.

Yeah, I just removed that line. We want prefixing / gone, but we want to keep the suffix ones.
14495ad

The only reason we didn't have any tests failing is because we weren't including a path whose top-level directory started with an excluded directory name.

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

vinistock · 2024-10-04T19:51:57Z

lib/ruby_indexer/lib/ruby_indexer/configuration.rb

+
+ @excluded_patterns = T.let(
+ [
+ File.join("**", "*_test.rb"),


Let's take advantage of the change and exclude the test and spec directories completely.

Suggested change

File.join("**", "*_test.rb"),

File.join("test", "**", "*"),

File.join("spec", "**", "*"),

This causes a test to fail, because of files like lib/ruby_indexer/test/configuration_test.rb

249167b

This is no longer true:
https://github.com/natematykiewicz/ruby-lsp/blob/050ee96a468cab64651f43bdd935fa3addafd8b6/lib/ruby_indexer/test/configuration_test.rb#L22

I re-added **/*_test.rb

1bf7699

Should we exclude sorbet/**/* while we're at it? At my job, sorbet directory is 1331 files. tmp is 146k files, which is why this change helps so much. Our spec directory is 558 files.

natematykiewicz · 2024-10-08T17:17:22Z

@vinistock just so you know, I think I've handled all of your feedback

natematykiewicz requested a review from a team as a code owner May 22, 2024 21:55

natematykiewicz requested review from andyw8 and vinistock May 22, 2024 21:55

natematykiewicz mentioned this pull request May 22, 2024

Prevent test suite from modifying contributors' Bundler settings #2083

Merged

natematykiewicz force-pushed the improve_indexables_performance branch 2 times, most recently from fabdd05 to 460e046 Compare May 23, 2024 17:13

andyw8 added enhancement New feature or request server This pull request should be included in the server gem's release notes labels May 23, 2024

vinistock reviewed May 29, 2024

View reviewed changes

vinistock reviewed Jun 7, 2024

View reviewed changes

natematykiewicz force-pushed the improve_indexables_performance branch from c104c96 to 1651c8a Compare June 8, 2024 04:56

natematykiewicz commented Jun 8, 2024

View reviewed changes

andyw8 removed their request for review August 8, 2024 16:07

andyw8 marked this pull request as draft August 8, 2024 16:09

natematykiewicz force-pushed the improve_indexables_performance branch from 3a8c7fe to 4279f64 Compare September 25, 2024 19:46

natematykiewicz marked this pull request as ready for review September 25, 2024 19:47

natematykiewicz force-pushed the improve_indexables_performance branch from 4279f64 to 804afb0 Compare September 25, 2024 19:54

natematykiewicz force-pushed the improve_indexables_performance branch from 804afb0 to 39ca825 Compare September 25, 2024 19:56

vinistock reviewed Oct 4, 2024

View reviewed changes

natematykiewicz added 5 commits October 7, 2024 16:33

Avoid unnecessary iterations with filter_map

c16b362

Exclude spec and test directories entirely

249167b

Avoid computing the same value N times

050ee96

Remove unnecessary suffix delete

14495ad

Re-exclude **/*_test.rb. They're not inside test/**/*.

1bf7699

		patterns = indexable_included_file_patterns(Dir.pwd, merge_exclude_patterns).map! { \|dir\| File.join(dir, "*.rb") }
		patterns = [File.join(Dir.pwd, "*/.rb")] if patterns.empty?

	File.join("*", "_test.rb"),
	File.join("test", "*", ""),
	File.join("spec", "*", ""),

Improve performance of finding indexables #2082

Are you sure you want to change the base?

Improve performance of finding indexables #2082

Conversation

natematykiewicz commented May 22, 2024 • edited Loading

Motivation

Implementation

Automated Tests

Manual Tests

natematykiewicz commented May 23, 2024

vinistock left a comment

Choose a reason for hiding this comment

natematykiewicz commented May 31, 2024

natematykiewicz commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natematykiewicz Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinistock commented Jul 10, 2024

natematykiewicz commented Jul 10, 2024

natematykiewicz commented Jul 16, 2024

natematykiewicz commented Jul 16, 2024

vinistock commented Jul 17, 2024

natematykiewicz commented Jul 17, 2024

andyw8 commented Aug 8, 2024

natematykiewicz commented Aug 8, 2024

natematykiewicz commented Aug 8, 2024 • edited Loading

natematykiewicz commented Aug 20, 2024

natematykiewicz commented Sep 25, 2024

natematykiewicz commented Sep 25, 2024

vinistock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natematykiewicz commented Oct 8, 2024

natematykiewicz commented May 22, 2024 •

edited

Loading

natematykiewicz Jun 10, 2024 •

edited

Loading

natematykiewicz commented Aug 8, 2024 •

edited

Loading