[RFC] thinLTO for SYCL #15083

sarnex · 2024-08-14T20:22:45Z

Hi all,

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Firstly, if someone wants to try this, you can just clone my branch, build the compiler with --cmake-opt="-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=SPIRV", and then compile SYCL programs with --offload-new-driver -foffload-lto=thin.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes:
Generating the module properties
Generating the module symbol table
Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)

And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)

It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture.
Once it's finalized, I'll make a design document to be submitted to this repository.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex · 2024-08-15T14:57:07Z

libdevice/device.h

+#define DEVICE_EXTERNAL SYCL_EXTERNAL
 #else // __SYCL_DEVICE_ONLY__
-#define DEVICE_EXTERNAL __attribute__((weak))
+#define DEVICE_EXTERNAL
 #endif // __SYCL_DEVICE_ONLY__


This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.

I suppose importing devicelib symbols at compile step can be a solution (see #15114).

On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.

attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.

As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the clang-linker-wrapper informing which device libraries should be linked. The less tie-in we have between the driver and the clang-linker-wrapper at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.

I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?

That is very interesting. While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step. Could I get some clarification on that?

@Naghasan, @npmiller, are you able to help here?

llvm/lib/LTO/LTO.cpp

clang/lib/Driver/ToolChains/Clang.cpp

bader · 2024-08-16T22:44:05Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+ // We need to set up the backend to use thinLTO
+ // even if we don't actually use it, and there is no
+ // backend for the spir64 triple, so override it to
+ // the SPIR-V backlend.
+ // TODO: Remove once SYCL uses the SPIR-V backend.


As an alternative, we can extend the idea @Naghasan proposed in #7670
(i.e. implement a minimal set of backend interfaces to enable middle-end transformations including thinLTO).

+1 on this. I also briefly tried to use backend for 'linking' LLVM IR files. I am planning to revisit that PR (#13395) soon.
Also, eventually, we will use SPIRV64 (instead of SPIR64) as the target triple for SYCL purposes.
This change is agreeable as a stop gap. I think.
Adding @Naghasan for comments.

Thanks

Sounds good to me, the fact we can't get anywhere with the spir triples is a bit frustrating.

bader · 2024-08-19T15:46:14Z

libdevice/device.h

+#define DEVICE_EXTERNAL SYCL_EXTERNAL
 #else // __SYCL_DEVICE_ONLY__
-#define DEVICE_EXTERNAL __attribute__((weak))
+#define DEVICE_EXTERNAL
 #endif // __SYCL_DEVICE_ONLY__


I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

bader · 2024-08-21T20:55:15Z

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

Generating the module properties.
Generating the module symbol table.
Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.
Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:

llvm/llvm/lib/Transforms/IPO/Internalize.cpp

Lines 43 to 52 in ee97531

 // APIFile - A file which contains a list of symbol glob patterns that should 

 // not be marked external. 

 static cl::opt<std::string> 

 APIFile("internalize-public-api-file", cl::value_desc("filename"), 

 cl::desc("A file containing list of symbol names to preserve")); 

 // APIList - A list of symbol glob patterns that should not be marked internal. 

 static cl::list<std::string> 

 APIList("internalize-public-api-list", cl::value_desc("list"), 

 cl::desc("A list of symbol names to preserve"), cl::CommaSeparated);

,

llvm/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Lines 213 to 218 in ee97531

 // Option to run internalize pass. 

 static cl::opt<bool> InternalizeSymbols( 

 "amdgpu-internalize-symbols", 

 cl::desc("Enable elimination of non-kernel functions and unused globals"), 

 cl::init(false), 

 cl::Hidden);

.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)
And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)
It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

asudarsa · 2024-08-27T19:53:49Z

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.
The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows:
We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).
Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.
In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator
In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.
One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

Generating the module properties.

Generating the module symbol table.

Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.

Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:

llvm/llvm/lib/Transforms/IPO/Internalize.cpp

Lines 43 to 52 in ee97531

// APIFile - A file which contains a list of symbol glob patterns that should

// not be marked external.

static cl::opt<std::string>

APIFile("internalize-public-api-file", cl::value_desc("filename"),

cl::desc("A file containing list of symbol names to preserve"));

// APIList - A list of symbol glob patterns that should not be marked internal.

static cl::list<std::string>

APIList("internalize-public-api-list", cl::value_desc("list"),

cl::desc("A list of symbol names to preserve"), cl::CommaSeparated);

,

llvm/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Lines 213 to 218 in ee97531

// Option to run internalize pass.

static cl::opt<bool> InternalizeSymbols(

"amdgpu-internalize-symbols",

cl::desc("Enable elimination of non-kernel functions and unused globals"),

cl::init(false),

cl::Hidden);

.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).
We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.
As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.
I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.
In terms of test results:
Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)
And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)
It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

asudarsa · 2024-08-27T20:53:30Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

@@ -941,6 +1016,13 @@ static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
 return *TempFileOrErr;
 }

+static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,


Just a curiousity question. Why was this modification required?
Seems like functionally equivalent to the old flow and the way it's invoked for thinLTO also seems similar.

Thanks

It was because I was hitting multithreading issues, we can't use StringRefs or Args as we want on any thread, so I computed the args before the thinLTO threads are created to work around the issue. Definitely a hack and will require a real solution at some point.

Considering that DPC++ passes most of end-to-end tests with SPIR-V backend, should we drop the translator support in thinLTO flow? I suppose we won't hit multithreading issues with SPIR-V backend.

That definitely sounds like a good idea to me. @asudarsa, you mentioned we were seeing 95% pass rate on E2E tests with optimization enabled with the SPIR-V backend. Do you agree with dropping support for llvm-spirv in thinLTO and only supporting the backend?

+1 on that. I am planning to call for a quick meeting before we do this. I am hoping to use SPIR-V backend for new offload model while we continue to use SPIR-V translator in the old offloading model. This will make the switch to NOT impact our customers right away.

asudarsa · 2024-08-27T21:03:05Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

- Args.hasArg(OPT_clang_backend)
- ? createLTO(Args, Features, OutputBitcode)
- : createLTO(Args, Features);
+ auto LTOBackend =


This does not look like a SYCL specific change (atleast on face value). Should this go upstream?

It is not technically SYCL specific, but in this PR SYCL is the only client using the new arguments.

bader · 2024-08-27T22:15:46Z

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

I mean the pipeline built here:

llvm/llvm/include/llvm/Passes/PassBuilder.h

Lines 264 to 282 in 283073a

 /// Build a pre-link, ThinLTO-targeting default optimization pipeline to 

 /// a pass manager. 

 /// 

 /// This adds the pre-link optimizations tuned to prepare a module for 

 /// a ThinLTO run. It works to minimize the IR which needs to be analyzed 

 /// without making irreversible decisions which could be made better during 

 /// the LTO run. 

 ModulePassManager buildThinLTOPreLinkDefaultPipeline(OptimizationLevel Level); 

 /// Build a ThinLTO default optimization pipeline to a pass manager. 

 /// 

 /// This provides a good default optimization pipeline for link-time 

 /// optimization and code generation. It is particularly tuned to fit well 

 /// when IR coming into the LTO phase was first run through \c 

 /// buildThinLTOPreLinkDefaultPipeline, and the two coordinate closely. 

 ModulePassManager 

 buildThinLTODefaultPipeline(OptimizationLevel Level, 

 const ModuleSummaryIndex *ImportSummary);

sarnex · 2024-08-28T14:40:29Z

Thanks all for the feedback, I'll start replying/addressing it shortly.

sarnex · 2024-08-29T19:58:19Z

@bader

Thanks for the detailed feedback.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

Okay, will work on it.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

Correct, we need to extract this information into a data structure which will be finally used by the runtime for many things, optimized args is one case. I originally wanted to do it inside the LTO framework so that we wouldn't have to look at the IR at all outside of the framework for performance reasons (thinLTO uses BitcodeModule which is lazy-load), but since we don't have one single module anymore and we shouldn't need functions linked in to compute this information, I think it's reasonable to do it outside the framework. Let me try this out.

Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

The problem with spec constants is that based on comments from @AlexeySachkov, we need the entire IR for the module to accurately do the processing, stubs for functions in other TUs won't cut it. The only time that is true with thinLTO is inside the thinLTO framework after function importing. @AlexeySachkov Let me know if I'm wrong about this requirement.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:

Yeah really good point here, we should just be able to make/modify a pass that is called by LTO. Let me work on this.

If these ideas work I think we can clean this up a lot, thanks again. Will report back when I have more info. Thanks again.

AlexeySachkov · 2024-09-06T15:16:20Z

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

In my opinion, this is a significant disadvantage of the design and should be avoided. There is an extension proposal with alternative API providing similar functionality w/o design limitations of spec constants API.

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

I agree that this is a problem and it effectively prevents use of a spec constants defined within a shared library by an application that's linked to that library.

sarnex · 2024-09-06T15:21:13Z

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies and cross-TU dependencies not linked in, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once? We wouldn't even need to run it inside thinLTO then, we could do it before, although it might be cleaner to do it there anyway.

AlexeySachkov · 2024-09-06T15:24:28Z

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once?

Right.

We wouldn't even need to run it inside thinLTO then, we could do it before or after.

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash. After is perfectly fine, though. That's how it is currently implemented except that we have full instead of thing and no optimizations :)

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

sarnex · 2024-09-06T15:28:54Z

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash.

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations. I think just using the PreCodeGenPassesHook lambda to call it from thinLTO where everything is linked in is the cleanest anyway.

AlexeySachkov · 2024-09-06T15:42:50Z

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations.

Optimizations are not a concern, importing of new functions is. We can't link two modules which were processed by spec constants pass, so it has to be run after importing, but regardless of optimizations.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

Signed-off-by: Sarnie <nick.sarnie@intel.com>

Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>

sarnex · 2024-09-19T17:41:38Z

@bader @asudarsa @AlexeySachkov @RaviNarayanaswamy @mdtoguchi @maksimsab

Hey all, I just pushed a new commit with a design doc and updated prototype implementation based on the feedback. Please take a look when you have a chance. Thanks!

AlexeySachkov

A very brief skim over, I haven't read the new doc yet

AlexeySachkov · 2024-09-19T18:03:32Z

libdevice/fallback-cassert.cpp

@@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file,
 __assertfail(expr, file, line, func, 1);
 }

-DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File,


I feel like this is a change that can be merged and submitted separately. _wassert is a wrapper for MSVC's assert implementation to redirect it to ours, so it really shouldn't be implemented in fallback library

I don't even know if it's correct, I just hit a build error on windows about _wassert defined twice, probably it works today because they're all weak symbols but I remove that as part of this PR.

AlexeySachkov · 2024-09-19T18:17:07Z

llvm/lib/SYCLLowerIR/SYCLLinkedModuleProcessor.cpp

+ }
+
+ bool runOnModule(Module &M) override {
+ // TODO: determine if we need to run other passes


If I understand correctly, that's an equivalent of what's being run by sycl-post-link after device code split is performed. If so, then we have the following other transformations applied at this stage:

ESIMD handling, which includes some special module fixup for invoke_simd, as well as potential additional split by ESIMD followed up by optional linking that back

Generation of a separate device image with default values of spec constants

If we also taking about what happens after llvm-link but before device code split, then it is also:

Something about invoke_simd

Sanitizer-related passes

Joint matrix passes

So when we do early splitting in -c we actually run sycl-post-link in full, including all those passes. So in that case, we only need to run passes here that need the fully linked module. If we decide to change the design such that we do only split in -c but no passes, then we would need every pass that sycl-post-link runs. In the current implementation ~2100/2200 E2E tests are passing, so it seems most passes don't need the full module and running it early does the right thing, at least for the test cases we have.

In the current implementation ~2100/2000 E2E tests are passing, so it seems most passes don't need the full module, at least for the test cases we have.

I believe that most of E2E are single-file tests with no SYCL_EXTERNAL dependencies. Even SYCL-CTS won't help you here. I suppose that we need more or less real-life applications here to be sure and gather more data if we need it

I suggest adding tests similar to sycl/test-e2e/Basic/multisource.cpp.

AlexeySachkov · 2024-09-19T18:20:49Z

sycl/doc/design/images/ThinLTOCommunityFlow.svg

I don't know what is our general stand on this, but GH natively supports rendering of diagrams from markdown using some mermaid library. See the docs here: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams
Full docs here: https://mermaid.js.org/intro/ (and you can also use live editor there to test stuff quicker than commit-push-refresh web page)

Tagging @intel/dpcpp-doc-reviewers here for opinion if we are fine to use that functionality. To me, that looks interesting: it should be more or less readable in raw form and it is way easier to edit and update than .svg files added to the repo. So I'm for allowing to us that functionality in our docs (and actually using it instead of accepting .svg files)

If mermaid allows creating all of the diagrams we need to, I'm definitely in favor of it, using .svg is a nightmare, at least for me.

No objections from my side. I see a few uses of mermaid diagrams in our documentation already:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_forward_progress.asciidoc#execution-model

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc#751-graph-state

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc#761-queue-state

Is it okay if I did this in a separate PR? It took me forever to make the current images and don't want to spend possibly forever making new ones :)

Among other changes: - removed trailing spaces - fixed 80-char line limitations

bader · 2024-09-20T00:56:33Z

@sarnex, I've made minor editorial changes, but unfortunately, I can't push them to your branch. Your repository doesn't allow me to create a pull request to your private branch. If you are okay with the changes from these two commits, please, cherry-pick them to your private branch.

bader

Documentation part review.

bader · 2024-09-20T00:58:34Z

sycl/doc/design/ThinLTO.md

+libraries) from other object files.
+
+We rely on the assumption that all function defintions matching a declaration 
+will be the same and we can let ThinLTO pull in any one.


C++ one definition rule guarantees this property of the code, doesn't it?

I think it depends what the original IR linkage is. if the original IR is linkonce_odr or something similar I think yes, but I don't know if we can guarantee every SYCL function will have that linkage (at least for libdevice it not this way in syclos HEAD)

bader · 2024-09-20T00:59:51Z

sycl/doc/design/ThinLTO.md

+ThinLTO infrastruction during the processing of `tu_bar` is valid.
+
+As a result of running early device code split, the fat object file generated 
+as part of device compilation may contain multiple device code images.


What would be the linkage type of foo definitions? We need to make sure that device images are linkable i.e. foo definitions will not conflict at link time.

Can this process duplicate SYCL kernel function definitions? If so, is SYCL runtime can handle this duplication?

After split, the linkage should be the same as it was before split.

After ThinLTO runs, it could have the same linkage as after splitting or it could be internalized (not yet implemented)

I don't think there is any way to get multiple kernel definitions in a way that isn't already possible with splitting. Maybe @AlexeySachkov has an idea.

After split, the linkage should be the same as it was before split.

I'm not 100% sure, but this might cause "multiple definition" problem. tu_foo has only one foo definition so using external is fine (assuming that all other modules referencing foo use different linkage types), but after split we will have foo defined in multiple modules. I'm not sure if LLVM allows linking modules where foo will have external linkage type.

I'll have to try some examples and tests and see if we hit a problem like this, in my testing I've never seen a duplicate symbol problem, only undefined symbol when importing fails for some reason, but of course we may just not have a test for the failing case.

sycl/doc/design/ThinLTO.md

bader · 2024-09-20T21:12:42Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+ auto SpecConstArg = sycl::areSpecConstsSupported(Args, Triple)
+ ? SpecConstantsPass::HandlingMode::native
+ : SpecConstantsPass::HandlingMode::emulation;


I think this logic should go inside the pass itself. It also can be handled by communicating with the Target via TTI to identify which mode is supported by the target.

@AlexeySachkov Any reservations about this change to SpecConstantsPass? Happy to make the change myself.

bader · 2024-09-20T21:21:52Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+ // Use the legacy PM because eventually we will use the
+ // PreCodeGenPassesHook field of LTOConfig which requires the legacy PM.
+ legacy::PassManager PM;


Based on LLVM documentation here and here, target independent middle-end optimizer is supposed to use new PM and only Target specific CodeGen PM should use legacy PM.
My understanding is that SpecConstArg is target independent pass.
If so, shouldn't we inject the pass with PostOptModuleHook instead of PreCodeGenPassesHook?

Since we return false from PreCodeGenPassesHook thus telling the LTO framework to not continue, PostOptModuleHook is not invoked. Once we use the SPIR-V backed, this should definitively be inside `PostOptModuleHook.

I used the legacy PM here and wrote a legacy pass to make it easier to use PostOptModuleHook once we use the SPIR-V backend.

Since we return false from PreCodeGenPassesHook thus telling the LTO framework to not continue, PostOptModuleHook is not invoked.

According to my understanding LTOBackend.cpp code, the PostOptModuleHook is invoked before the PreCodeGenPassesHook.

I think we need to separate codegen to SPIR-V and running "post-link" optimization passes. SYCLLinkedModuleProcessorPass can be added even into PostImportModuleHook as we want to run standard llvm optimizations after SYCL passes handle spec constants, joint matrix, etc.

Sorry, I typo'd it, the correct response is

Since we return false from PreCodeGenModuleHook thus telling the LTO framework to not continue, PreCodeGenPassesHook is not invoked.

But sure, I'll move to the SPV backend and apply this feedback

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

bader · 2024-09-20T22:29:21Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+ for (OffloadFile &OffF : Candidates)
+ if (llvm::Triple(OffF.getBinary()->getTriple()) == Triple)
+ DeviceLibs.emplace_back(std::move(OffF));


Suggested change

for (OffloadFile &OffF : Candidates)

if (llvm::Triple(OffF.getBinary()->getTriple()) == Triple)

DeviceLibs.emplace_back(std::move(OffF));

llvm::copy_if(Candidates, std::back_inserter(DeviceLibs),

[&Triple](const OffloadFile &OffF) {

return llvm::Triple(OffF.getBinary()->getTriple()) == Triple

});

we need to std::move the elements into the new vector, will copy_if do that?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

bader · 2024-09-20T22:43:05Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

@@ -941,6 +1016,13 @@ static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
 return *TempFileOrErr;
 }

+static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,


Considering that DPC++ passes most of end-to-end tests with SPIR-V backend, should we drop the translator support in thinLTO flow? I suppose we won't hit multithreading issues with SPIR-V backend.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

bader · 2024-09-20T23:21:45Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+ // For thinLTO, we consider device libs as normal compiler input
+ // and add them to the files to be processed by the LTO backend.


This is something we should review with the community. As of now, they import some parts of GPU devicelib at compile time (i.e. LLVM IR). Considering community interest in moving this action to the link step, we better align with them.

Looking at some patches in upstream, I feel like some calls to GPU libraries are handled at link step (e.g. llvm/llvm-project#98512).

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

bader · 2024-09-23T17:38:12Z

For some reason I can't reply inline

You can reply inline here: #15083 (comment). When you publish a review with new discussions (1) and replies to already existing discussions (2), the replies (2) are posted in the original discussion where you can answer inline and duplicated in the review w/o the inline comment option.

sarnex · 2024-09-23T17:41:51Z

Fixed, thx. Github UI confused me.

sarnex added 2 commits August 14, 2024 12:15

weak

817a0c9

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

thin

70ee0e0

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex had a problem deploying to WindowsCILock August 14, 2024 20:24 — with GitHub Actions Failure

sarnex temporarily deployed to WindowsCILock August 14, 2024 21:23 — with GitHub Actions Inactive

sarnex temporarily deployed to WindowsCILock August 15, 2024 14:14 — with GitHub Actions Inactive

sarnex commented Aug 15, 2024

View reviewed changes

llvm/lib/LTO/LTO.cpp Show resolved Hide resolved

sarnex temporarily deployed to WindowsCILock August 15, 2024 15:07 — with GitHub Actions Inactive

sarnex changed the title ~~[DO NOT REVIEW] Running CI~~ [RFC] thinLTO for SYCL Aug 15, 2024

sarnex marked this pull request as ready for review August 15, 2024 16:18

sarnex requested review from a team as code owners August 15, 2024 16:18

sarnex requested review from maarquitos14, bader, AlexeySachkov, RaviNarayanaswamy, mdtoguchi, asudarsa, maksimsab and hansangbae August 15, 2024 16:18

bader reviewed Aug 19, 2024

View reviewed changes

asudarsa reviewed Aug 27, 2024

View reviewed changes

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Show resolved Hide resolved

asudarsa reviewed Aug 27, 2024

View reviewed changes

sarnex added 5 commits September 18, 2024 08:16

add design doc, rework prototype with new design

00a28c1

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

Merge remote-tracking branch 'upstream/sycl' into HEAD

d3ccfd1

fix merge

8c4edb3

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

fix nvptx libdevice duplicate symbol error, investigate later

3322684

Signed-off-by: Sarnie <nick.sarnie@intel.com>

typos

43df8c9

Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>

sarnex requested a review from a team as a code owner September 19, 2024 17:40

sarnex temporarily deployed to WindowsCILock September 19, 2024 17:40 — with GitHub Actions Inactive

AlexeySachkov reviewed Sep 19, 2024

View reviewed changes

sarnex temporarily deployed to WindowsCILock September 19, 2024 18:24 — with GitHub Actions Inactive

bader self-requested a review September 20, 2024 00:12

bader added 2 commits September 19, 2024 17:49

[NFC] Fix typos, markdown linter issues

9922d30

Among other changes: - removed trailing spaces - fixed 80-char line limitations

[NFC] Small word tweaking.

1deaea2

bader reviewed Sep 20, 2024

View reviewed changes

sarnex had a problem deploying to WindowsCILock September 20, 2024 14:16 — with GitHub Actions Failure

bader self-requested a review September 20, 2024 18:51

bader reviewed Sep 20, 2024

View reviewed changes

sarnex temporarily deployed to WindowsCILock September 20, 2024 23:55 — with GitHub Actions Inactive

sarnex temporarily deployed to WindowsCILock September 21, 2024 00:38 — with GitHub Actions Inactive

address initial feedback

c68e797

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex temporarily deployed to WindowsCILock September 23, 2024 15:43 — with GitHub Actions Inactive

sarnex temporarily deployed to WindowsCILock September 23, 2024 16:28 — with GitHub Actions Inactive

		// For thinLTO, we consider device libs as normal compiler input
		// and add them to the files to be processed by the LTO backend.

[RFC] thinLTO for SYCL #15083

Are you sure you want to change the base?

[RFC] thinLTO for SYCL #15083

Conversation

sarnex commented Aug 14, 2024 • edited Loading

sarnex Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Aug 21, 2024

asudarsa commented Aug 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Aug 27, 2024

sarnex commented Aug 28, 2024

sarnex commented Aug 29, 2024 • edited Loading

AlexeySachkov commented Sep 6, 2024

sarnex commented Sep 6, 2024 • edited Loading

AlexeySachkov commented Sep 6, 2024 • edited Loading

sarnex commented Sep 6, 2024 • edited Loading

AlexeySachkov commented Sep 6, 2024

sarnex commented Sep 19, 2024

AlexeySachkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarnex Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Sep 20, 2024

bader left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarnex Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarnex Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Sep 23, 2024

sarnex commented Sep 23, 2024

sarnex commented Aug 14, 2024 •

edited

Loading

sarnex Aug 15, 2024 •

edited

Loading

sarnex commented Aug 29, 2024 •

edited

Loading

sarnex commented Sep 6, 2024 •

edited

Loading

AlexeySachkov commented Sep 6, 2024 •

edited

Loading

sarnex commented Sep 6, 2024 •

edited

Loading

sarnex Sep 19, 2024 •

edited

Loading

sarnex Sep 23, 2024 •

edited

Loading

sarnex Sep 23, 2024 •

edited

Loading