-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] thinLTO for SYCL #15083
base: sycl
Are you sure you want to change the base?
[RFC] thinLTO for SYCL #15083
Changes from all commits
817a0c9
70ee0e0
00a28c1
d3ccfd1
8c4edb3
3322684
43df8c9
9922d30
1deaea2
c68e797
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
set(LLVM_LINK_COMPONENTS | ||
${LLVM_TARGETS_TO_BUILD} | ||
BitReader | ||
BitWriter | ||
Core | ||
BinaryFormat | ||
|
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file, | |
__assertfail(expr, file, line, func, 1); | ||
} | ||
|
||
DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this is a change that can be merged and submitted separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't even know if it's correct, I just hit a build error on windows about |
||
unsigned _Line) { | ||
__assertfail(_Message, _File, _Line, 0, 1); | ||
} | ||
|
||
#endif |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
//===-- SYCLLinkedModuleProcessor.h - finalize a fully linked module ---===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// | ||
// The file contains a number of functions to create a pass that can be called | ||
// by the LTO backend that will finalize a fully-linked module. | ||
//===----------------------------------------------------------------------===// | ||
#pragma once | ||
#include "SpecConstants.h" | ||
namespace llvm { | ||
|
||
class PassRegistry; | ||
class ModulePass; | ||
ModulePass * | ||
createSYCLLinkedModuleProcessorPass(llvm::SpecConstantsPass::HandlingMode); | ||
void initializeSYCLLinkedModuleProcessorPass(PassRegistry &); | ||
|
||
} // namespace llvm |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
//===-- SYCLLinkedModuleProcessor.cpp - finalize a fully linked module ---===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// See comments in the header. | ||
//===----------------------------------------------------------------------===// | ||
|
||
#include "llvm/SYCLLowerIR/SYCLLinkedModuleProcessor.h" | ||
|
||
#include "llvm/Pass.h" | ||
|
||
#define DEBUG_TYPE "sycl-linked-module-processor" | ||
using namespace llvm; | ||
|
||
namespace { | ||
class SYCLLinkedModuleProcessor : public ModulePass { | ||
public: | ||
static char ID; | ||
SYCLLinkedModuleProcessor(SpecConstantsPass::HandlingMode Mode) | ||
: ModulePass(ID), Mode(Mode) { | ||
initializeSYCLLinkedModuleProcessorPass(*PassRegistry::getPassRegistry()); | ||
} | ||
|
||
bool runOnModule(Module &M) override { | ||
// TODO: determine if we need to run other passes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, that's an equivalent of what's being run by
If we also taking about what happens after
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So when we do early splitting in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe that most of E2E are single-file tests with no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest adding tests similar to |
||
ModuleAnalysisManager MAM; | ||
SpecConstantsPass SCP(Mode); | ||
auto PA = SCP.run(M, MAM); | ||
return !PA.areAllPreserved(); | ||
} | ||
|
||
private: | ||
SpecConstantsPass::HandlingMode Mode; | ||
}; | ||
} // namespace | ||
char SYCLLinkedModuleProcessor::ID = 0; | ||
INITIALIZE_PASS(SYCLLinkedModuleProcessor, "SYCLLinkedModuleProcessor", | ||
"Finalize a fully linked SYCL module", false, false) | ||
ModulePass *llvm::createSYCLLinkedModuleProcessorPass( | ||
SpecConstantsPass::HandlingMode Mode) { | ||
return new SYCLLinkedModuleProcessor(Mode); | ||
} |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,149 @@ | ||||||
# ThinLTO for SYCL | ||||||
|
||||||
This document describes the purpose and design of ThinLTO for SYCL. | ||||||
|
||||||
**NOTE**: This is not the final version. The document is still in progress. | ||||||
|
||||||
## Background | ||||||
|
||||||
With traditional SYCL device code linking, all user code is linked together | ||||||
along with device libraries into a single huge module and then split and | ||||||
processed by `sycl-post-link`. This requires sequential processing, has a large | ||||||
memory footprint, and differs from the linking flow for AMD and NVIDIA devices. | ||||||
|
||||||
## Summary | ||||||
|
||||||
SYCL ThinLTO will hook into the existing community mechanism to run LTO as part | ||||||
of device linking inside `clang-linker-wrapper`. We split the device images | ||||||
early at compilation time, and at link time we use ThinLTO's function importing | ||||||
feature to bring in the definitions for referenced functions. Only the new | ||||||
offload model is supported. | ||||||
|
||||||
## Device code compilation time changes | ||||||
|
||||||
Most of the changes for ThinLTO occur during device link time, however there is | ||||||
one major change during compilation (-c) time: we now run device code split | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
during compilation instead of linking. The main reason for doing this is | ||||||
increased parallelization. Many compilation jobs can be run at the same time, | ||||||
but linking happens once total for the application. Device code split is | ||||||
currently a common source of performance issues. | ||||||
|
||||||
Splitting early means that the resulting IR after splitting is not complete, it | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that early split leads to incomplete modules. Lack of linking leads to incomplete modules, i.e. a single translation unit with a call to undefined If it was complete, then current version of the splitting mechanism won't break it, i.e. every resulting module would still be complete (at the cost of code duplication). However, I'm not 100% about the latter, if shared libraries are involved, that behavior definitely changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, you're right, let me reword it |
||||||
still may contain calls to functions (user code and/or the SYCL device | ||||||
libraries) defined in other translation units. | ||||||
|
||||||
We rely on the assumption that all function definitions matching a declaration | ||||||
will be the same and we can let ThinLTO pull in any one. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. C++ one definition rule guarantees this property of the code, doesn't it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it depends what the original IR linkage is. if the original IR is |
||||||
|
||||||
For example, let's start with user device code that defines a `SYCL_EXTERNAL` | ||||||
function `foo` in translation unit `tu_foo`. There is also another translation | ||||||
unit `tu_bar` that references `foo`. During the early device code splitting run | ||||||
of `tu_foo`, we may find that more than one of the resultant device images | ||||||
contain a definition for `foo`. | ||||||
Comment on lines
+40
to
+42
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we clarify the reason why Also, we have a flag (originally introduced to support shared libraries) which allows to disable cloning of functions which are shared between device images produced by code splitting. We could probably use that mode for thinLTO to avoid unnecessary duplicates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will add this, thanks |
||||||
|
||||||
We assert that any function definition for `foo` that is deemed a match by the | ||||||
ThinLTO infrastructure during the processing of `tu_bar` is valid. | ||||||
|
||||||
As a result of running early device code split, the fat object file generated as | ||||||
part of device compilation may contain multiple device code images. | ||||||
|
||||||
## Device code link time changes | ||||||
|
||||||
Before we go into the link time changes for SYCL, let's understand the device | ||||||
linking flow for AMD/NVIDIA devices: | ||||||
Comment on lines
+52
to
+53
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read this phrase as SYCL linking flow for AMD/NVIDIA devices, but if that reading is correct, then requirement (2) looks weird, because it is not dependent on a device. Perhaps this phrase was referring to upstream handling of OpenMP offloading? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah it's for omp, will reword. thanks |
||||||
|
||||||
![Community linking flow](images/ThinLTOCommunityFlow.svg) | ||||||
|
||||||
SYCL has two differentiating requirements: | ||||||
|
||||||
1) The SPIR-V backend is not production ready and the SPIR-V translator is used. | ||||||
2) The SYCL runtime requires metadata (module properties and module symbol | ||||||
table) computed from device images that will be stored along the device images | ||||||
in the fat executable. | ||||||
|
||||||
The effect of requirement 1) is that instead of letting ThinLTO call the SPIR-V | ||||||
backend, we add a callback that runs right before CodeGen would run. In that | ||||||
callback, we call the SPIR-V translator and store the resultant file path for | ||||||
use later, and we instruct the ThinLTO framework to not perform CodeGen. | ||||||
Comment on lines
+66
to
+67
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: we should be able to call the translator as a library and store the final binary (in form of a |
||||||
|
||||||
An interesting additional fact about requirement 2) is that we actually need to | ||||||
process fully linked module to accurate compute the module properties. One | ||||||
example where we need the full module is to [compute the required devicelib | ||||||
mask](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLLowerIR/SYCLDeviceLibReqMask.cpp). | ||||||
If we only process the device code that was included in the original fat object | ||||||
input to `clang-linker-wrapper`, we will miss devicelib calls in referenced | ||||||
`SYCL_EXTERNAL` functions. | ||||||
|
||||||
The effect of requirement 2) is that we store the fully linked device image for | ||||||
metadata computation in the SYCL-specific handing code after the ThinLTO | ||||||
framework has completed. Another option would be to try to compute the metadata | ||||||
inside the ThinLTO framework callbacks, but this would require SYCL-specific | ||||||
arguments to many caller functions in the stack and pollute community code. | ||||||
|
||||||
Here is the current ThinLTO flow for SYCL: | ||||||
|
||||||
![SYCL linking flow](images/ThinLTOSYCLFlow.svg) | ||||||
|
||||||
We add a `PreCodeGenModuleHook` function to the `LTOConfig` object so that we | ||||||
can process the fully linked module without running the backend. | ||||||
|
||||||
However, the flow is not ideal for many reasons: | ||||||
|
||||||
1) We are relying on the external `llvm-spirv` tool instead of the SPIR-V | ||||||
backend. We could slightly improve this issue by using a library call to the | ||||||
SPIR-V translator instead of the tool, however the library API requires setting | ||||||
up an object to represent the arguments while we only have strings, and it's | ||||||
non-trivial to parse the strings to figure out how to create the argument | ||||||
Comment on lines
+94
to
+96
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I right that you are talking about extensions here? I don't think that it is necessarily non-trivial. If what we get from clang driver is what we should pass to https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/tools/llvm-spirv/llvm-spirv.cpp#L516 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's not just that, if it were the extensions it would be not that bad. let's consider the flag In If I missed some obvious easy way of doing it let me know There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do we have a public I guess my high-level point is: do we really have to pass all those flags through the linker-wrapper, or can we just come up with their values within clang-linker wrapper based on the module itself? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, The consensuses from other reviewers here is to just use the SPIR-V backend now since it is apparently pretty good, so I'm not planning to use the translator anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You seem to be talking about how to initialize globals within There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, you're right, I'm too tried. Still, having to parse the strings from the driver and map it to the getter/setter for the object seems not ideal. |
||||||
object. Since we plan to use the SPIR-V backend in the long term, this does not | ||||||
seem to be worth the effort. | ||||||
|
||||||
2) We manually run passes inside `PreCodeGenModuleHook`. This is because we | ||||||
don't run CodeGen, so we can't take advantage of the `PreCodeGenPassesHook` | ||||||
field of `LTOConfig` to run some custom passes, as those passes are only run | ||||||
when we actually are going to run CodeGen. | ||||||
|
||||||
3) We have to store the fully linked module. This is needed because we need a | ||||||
fully linked module to accurately compute metadata, see the above explanation of | ||||||
SYCL requirement 2). We could get around storing the module by computing the | ||||||
metadata inside the LTO framework and storing it for late use by the SYCL | ||||||
bundling code, but doing this would require even more SYCL-only customizations | ||||||
including even more new function arguments and modifications of the | ||||||
`OffloadFile` class. There are also compilations because the LTO framework is | ||||||
Comment on lines
+109
to
+111
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would it require modifications to |
||||||
multithreaded, and not all LLVM data structures are thread safe. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not clear to me what is there that requires multithreading when each thinLTO thread should be working on its own resulting module alone. Or am I missing something? |
||||||
|
||||||
The proposed long-term SYCL ThinLTO flow is as follows: | ||||||
|
||||||
![SYCL SPIR-V backend linking flow](images/ThinLTOSYCLSPIRVBackendFlow.svg) | ||||||
|
||||||
The biggest difference here is that we are running CodeGen using the SPIR-V | ||||||
backend. | ||||||
|
||||||
Also, instead of using a lambda function in the `PreCodeGenModuleHook` callback | ||||||
to run SYCL finalization passes, we can take advantage of the | ||||||
`PreCodeGenPassesHook` field to add passes to the pass manager that the LTO | ||||||
framework will run. | ||||||
Comment on lines
+121
to
+124
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got confused by this paragraph, because it contradicts some previous statement. Does it make sense to create sub-sections for short- and long-term designs of thinLTO to better split those two descriptions? |
||||||
|
||||||
It is possible that the number of device images in the fat executable and which | ||||||
device image contains which kernel is different with ThinLTO enabled, but we do | ||||||
expect this to have any impact on correctness or performance, nor we do expect | ||||||
Comment on lines
+127
to
+128
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "but we do expect this..." - missing "not"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, thx |
||||||
users to care. | ||||||
|
||||||
## Current limitations | ||||||
|
||||||
`-O0`: Compiling with `-O0` prevent clang from generating ThinLTO metadata | ||||||
during the compilation phase. In the current implementation, this is an error. | ||||||
In the final version, we could either silently fall back to full LTO or generate | ||||||
ThinLTO metadata even for `-O0`. | ||||||
|
||||||
SYCL libdevice: Current all `libdevice` functions are explicitly marked to be | ||||||
weak symbols. The ThinLTO framework does not consider a definition of function | ||||||
with weak linkage as it cannot be sure that this definition is the correct one. | ||||||
Ideally we could remove the weak symbol annotation. | ||||||
|
||||||
No binary linkage: The SPIR-V target does not currently have a production | ||||||
quality binary linker. This means that we must generate a fully linked image as | ||||||
part of device linkage. At least for AMD devices, this is not a requirement as | ||||||
Comment on lines
+143
to
+145
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SPIR-V may be incomplete, there is nothing in the standard which disallows that. However, producing incomplete SPIR-V modules in the compiler will require SYCL RT changes so that SYCL RT can pass a set of SPIR-V modules into device JIT compiler which when linked together produce a complete module. |
||||||
`lld` is used for the final link which can resolve any unresolved symbols. | ||||||
`-fno-gpu-rdc` is default for AMD, so in that case it can call `lld` during | ||||||
compile, but if `-fno-gpu-rdc` is passed, the lld call happens as part of | ||||||
`clang-linker-wrapper` to resolve any symbols not resolved by ThinLTO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose importing devicelib symbols at compile step can be a solution (see #15114).
On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the
clang-linker-wrapper
informing which device libraries should be linked. The less tie-in we have between the driver and theclang-linker-wrapper
at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking
clang-linker-wrapper
. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values toclang-linker-wrapper
where these options should be processed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Naghasan, @npmiller, are you able to help here?