-
-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📎 Multi-file support #3307
Comments
Thanks for this outline! In general I agree with the two-pass approach. I think indeed there should be one pass that collects data we need and one that does analysis with the data available over all files. As you mentioned, between the passes there may indeed be a step that detects and defines dependencies as well; the creation of the Project data structure. I saw one part that made me cautious:
I am somewhat afraid this might be approaching the direction from the wrong angle. Rather than querying from the place where the data is needed and caching results, I think it’s better to build (most of) the data from the ground during the first phase so that it’s already available when the second phase starts. This does mean we end up computing some information we end up not querying, but I see some convincing reasons we may want to take this approach anyway:
And finally, if the goal is to avoid doing unnecessary work, I think it’s best to split the types of data we gather into various services, and only activate the services that have one or more rules registered for them. But if we make data gathering itself too fine-grained, I think we’ll make things too complex for ourselves and I’m skeptical the work we safe would outweigh the benefits of better parallelization. Unrelated question: Since the data structure is called Project for now? Do you think this provides a good opportunity to improve our monorepo support as well? |
So I think a little bit more concretely I would suggest something in the following direction: Initialization step: Config ResolutionThis is very much like today, we load the config files. Since we decide here which rules should be enabled, we’ll know which services to activate for the next step. Phase 1: Traversal and DiscoveryThis step performs the filesystem traversal and parses all the supported files. It also extracts file-specific data for enabled services. Examples of services could be:
I included project structure in this list because I think if we do this right, it might help us to fix our monorepo support. This probably means we’ll treat Intermediary step: Project ResolutionAt this point we look at everything we’ve gathered and build a dependency tree. We create a one or more This step is probably hard to parallelize, since graph manipulation is difficult to multi thread, which is why I think we should have most data available already. File resolution also happens here, and I think we can be clever here by not doing any additional filesystem I/O. We’ve already done a traversal, so we know the file paths and their contents. Phase 1b: Traversal and Discovery of DependenciesSome services, such as type inference, may want to extract information from external packages. We couldn’t do that during phase 1, since we hadn’t done any file resolution yet. But now that we know and have identified what is missing we can repeat phase 1 for these resources. The main difference is that it won’t be a file system traversal, but traversal of the dependency graph for missing data. It may take file system access, or even network access, to fully resolve everything though. Phase 2: AnalysisNow we have well-defined projects, their dependencies, full file resolutions and a bunch of extracted metadata. It’s time to run our rules :) This can be fully parallelized again since the data in the services is fixed at this point. |
One more thing I’d like to add is that I think an approach like this could scale very well to our LSP use case too, since every time a file is changed, we just re-run the phase 1 discovery on it after which we only may need to update the |
There's also one more thing that I would like to consider as part of this refactor/feature. If you don't know I have been looking at Their website lists the documentation of the upcoming v3, and the old website is now gone. Another word of advice: The documentation lacks a bit, especially when it comes to how to use |
One thing I am afraid of is unnecessary traversing of directories (including symbolic links). We could take a more granular approach, where we load the files in the currently traversed directory and decide - based on This has one drawback: a handled file may import a file in an ignored Note that we need to clarify one thing: if a user globally includes directories, we should only process them. However, if a user doesn't use Using this approach, will require to slightly precise or change the semantic of our
It is not clear to me why a module graph is needed for multi-file analysis.
If we keep our linter query-based system, most of the current linter rules could be executed in the first phase because they visit the AST. Is there any reason to defer them until after the dependencies have been resolved?
|
I think Another thing to consider: As our plugin efforts progress, we may want to give plugins the ability to extract information from files similar to how Rust services can. This might be possible with
(emphasis mine)
I think indeed processing files outside the traversed hierarchy is what we want, and may even need to extract information from dependencies. I would say this is true even if the user only specified a limited set of
I think it's more a convenience than a necessity. Indeed as soon as you extract imports and are able to resolve them, you already have the graph implicitly. Putting it in a data structure makes access more convenient. Cycle detection is indeed a good example of that :) I wouldn't say it necessarily needs to be a
You're right, they could be. The main reason I figured it might be better to do all analysis in phase 2 was simplicity: If all analysis rules run in the same phase, there might be less bookkeeping involved. Maybe your suggestion could actually improve performance by giving the workers more to do while we also perform a lot of filesystem I/O, so I wouldn't rule out that option either. |
We were discussing this on Discord, but adding here for visibility: I think it could be a good idea to build an initial version without dependency graph and use it to power the GraphQL analyzer. Might be a good PoC for a first iteration. |
Update: I've been doing some more exploration on how we can implement some of the stuff described here. This has helped me to realize some of the difficulties that lie ahead of us, so I'll share my insights here:
For those curious, my current branch can be explored here, but it's in a bit of a messy state, since I also took some wrong corners before learning some of the lessons I shared here 😅 https://github.com/arendjr/biome/tree/dependency-graph |
Preface
We want to implement a feature that allows the various features of Biome to tap into information regarding other files. At the moment, features like lint rules don't have access to information that belongs to other files. Some examples are exported bindings, exported types, types of modules, etc.
An entity we could call
Project
(the name is not set in stone) should provide this kind of information.Design
Following, I will list some design aspects of the feature, which we can discuss. Don't take everything as set in stone, but assimilate it and then debate it. I will try to do my best to outline the current archicture as best as I can.
First, the architecture should follow this principle:
Playing actors
At the moment, we have the current actors at play:
FileSystem
trait. A generic trait meant to crawl the file system and evaluate if certain paths (directories, files, symbolic links, etc.) should be checked. Two types implement this traitOsFileSystem
andMemoryFileSystem
. TheFileSystem
doens't know which files should be handled. The logic that checks if those paths can be handled is implemented elsewhere via another trait calledTraversalContext
. As you can see, we are going elsewhere.OsFileSystem
, however, has an interner for paths, but it isn't related to the paths checked.WorkspaceServer
can be seen as our query compiler. The end consumer (e.g. CLI or LSP) has control over the compiler and tells it what to do. TheWorkspaceServer
has some characteristics:ProjectData
is already in use. It knows that a client might have different projects with different settings:biome/crates/biome_service/src/settings.rs
Lines 52 to 57 in eafc62e
The
FileSystem
and theWorkspaceServer
rarely interact with each other. The only moment when they is for just one operation:fs.open_file
->workspace.open_file
Possible data flow
I envision the
Project
to be part of theWorkspaceServer
. If so,Project
must beSend
,Sync
andRefUnwindSafe
. This means that we need to choose data that meets those criteria.Fortunately, a module graph can be built with
petgraph
, which implements these traits.A
Project
shouldn't duplicate the information that are already present in theWorkspaceServer
, e.g. file content, CST.Where we should strive
Here are some ideas that we should consider while implementing the architecture:
CLI
From a CLI perspective, a way to build a
Project
and its graph is by doing two passes of the file system.One pass to collect all files. A second pass to check the files. After the first pass it could be possible build the
Project
we require.The second pass is more challenging because, between the first pass and the second pass, we don't have an entity that will tell the
WorkspaceServer
which files should be checked. Should we create one? Should we extendFileSystem
to tell us? Let's discuss itLSP
From the LSP perspective, it's easier to initialise a new
Project
, but it will be more challenging to update it. Every time a file is updated/created/deleted, the graph should be updated.References
Here, Jeroen Engels, creator of
elm-review
, explains how he does multi-file analysis inside the linter: https://jfmengels.net/multi-file-analysis/ .However, let's take into consideration that we have different constraints and requirements.
Tasks
I haven't talked about dependencies. Dependencies will be part of the graph, but they can be added later one, as we need a manifest to collect them (which we already have).
Project
insidebiome_project
(Feel free to add more tasks if you think they are needed)
The text was updated successfully, but these errors were encountered: