Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] thinLTO for SYCL #15083

Open
wants to merge 10 commits into
base: sycl
Choose a base branch
from
Open

[RFC] thinLTO for SYCL #15083

wants to merge 10 commits into from

Conversation

sarnex
Copy link
Contributor

@sarnex sarnex commented Aug 14, 2024

Hi all,

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Firstly, if someone wants to try this, you can just clone my branch, build the compiler with --cmake-opt="-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=SPIRV", and then compile SYCL programs with --offload-new-driver -foffload-lto=thin.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes:
Generating the module properties
Generating the module symbol table
Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)

And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)

It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture.
Once it's finalized, I'll make a design document to be submitted to this repository.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Comment on lines +20 to 23
#define DEVICE_EXTERNAL SYCL_EXTERNAL
#else // __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL
#endif // __SYCL_DEVICE_ONLY__
Copy link
Contributor Author

@sarnex sarnex Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to get libdevice functions linked in by the thinLTO function importing infrastructure, see here. I'm looking for a better solution for this, I just kept this here in case anybody plans on trying the prototype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose importing devicelib symbols at compile step can be a solution (see #15114).

On the other hand, I recall discussing the possibility of linking device libraries with upstream maintainers, who expressed a preference for shifting device library linking from the "compile" to the "link" step. It would be ideal if we could discover a solution that aligns with the long-term strategy of upstream and enables us to utilize the thinLTO framework for offload code linking, thereby avoiding the use of weak symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attention to @mdtoguchi who has been looking at importing devicelib at compile step from the SYCL perspective.
Point to note: During one of the LLVM community presentation, it was mentioned that they are trying to move importing devicelib to link time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we already perform device library linking at link time we can consider abandoning the efforts to pull them into the compilation step. My main concern with performing at the link step is the communication required from the driver to the clang-linker-wrapper informing which device libraries should be linked. The less tie-in we have between the driver and the clang-linker-wrapper at link time, the better. IMO, at the very least the linker wrapper should know a minimum default device libraries to link and any communication from the driver is manipulating that list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting.
While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step.
Could I get some clarification on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very interesting. While working on #15114 I've been wondering whether there is a particular reason why we link against CUDA libdevice and libclc in the compile step, but also again in the link step. Could I get some clarification on that?

@Naghasan, @npmiller, are you able to help here?

@sarnex sarnex changed the title [DO NOT REVIEW] Running CI [RFC] thinLTO for SYCL Aug 15, 2024
@sarnex sarnex marked this pull request as ready for review August 15, 2024 16:18
@sarnex sarnex requested review from a team as code owners August 15, 2024 16:18
clang/lib/Driver/ToolChains/Clang.cpp Outdated Show resolved Hide resolved
Comment on lines +1720 to +1724
// We need to set up the backend to use thinLTO
// even if we don't actually use it, and there is no
// backend for the spir64 triple, so override it to
// the SPIR-V backlend.
// TODO: Remove once SYCL uses the SPIR-V backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, we can extend the idea @Naghasan proposed in #7670
(i.e. implement a minimal set of backend interfaces to enable middle-end transformations including thinLTO).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. I also briefly tried to use backend for 'linking' LLVM IR files. I am planning to revisit that PR (#13395) soon.
Also, eventually, we will use SPIRV64 (instead of SPIR64) as the target triple for SYCL purposes.
This change is agreeable as a stop gap. I think.
Adding @Naghasan for comments.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, the fact we can't get anywhere with the spir triples is a bit frustrating.

Comment on lines +20 to 23
#define DEVICE_EXTERNAL SYCL_EXTERNAL
#else // __SYCL_DEVICE_ONLY__
#define DEVICE_EXTERNAL __attribute__((weak))
#define DEVICE_EXTERNAL
#endif // __SYCL_DEVICE_ONLY__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mdtoguchi. Unless user wants to change the names/location or disable linking of device libraries, driver should not have any logic to handle device code linking other than invoking clang-linker-wrapper. It makes sense to have driver options for additional configuration of device libraries, but driver's implementation should be just passing corresponding values to clang-linker-wrapper where these options should be processed.

@bader
Copy link
Contributor

bader commented Aug 21, 2024

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

  1. Generating the module properties.
  2. Generating the module symbol table.
  3. Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.
  4. Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.
  5. Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:
    // APIFile - A file which contains a list of symbol glob patterns that should
    // not be marked external.
    static cl::opt<std::string>
    APIFile("internalize-public-api-file", cl::value_desc("filename"),
    cl::desc("A file containing list of symbol names to preserve"));
    // APIList - A list of symbol glob patterns that should not be marked internal.
    static cl::list<std::string>
    APIList("internalize-public-api-list", cl::value_desc("list"),
    cl::desc("A list of symbol names to preserve"), cl::CommaSeparated);
    ,
    // Option to run internalize pass.
    static cl::opt<bool> InternalizeSymbols(
    "amdgpu-internalize-symbols",
    cl::desc("Enable elimination of non-kernel functions and unused globals"),
    cl::init(false),
    cl::Hidden);
    .

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)

And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)

It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

@asudarsa
Copy link
Contributor

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.
The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows:
We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).
Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.
In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator
In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.
One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

  1. Generating the module properties.
  2. Generating the module symbol table.
  3. Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.
  4. Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.
  5. Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:
    // APIFile - A file which contains a list of symbol glob patterns that should
    // not be marked external.
    static cl::opt<std::string>
    APIFile("internalize-public-api-file", cl::value_desc("filename"),
    cl::desc("A file containing list of symbol names to preserve"));
    // APIList - A list of symbol glob patterns that should not be marked internal.
    static cl::list<std::string>
    APIList("internalize-public-api-list", cl::value_desc("list"),
    cl::desc("A list of symbol names to preserve"), cl::CommaSeparated);

    ,
    // Option to run internalize pass.
    static cl::opt<bool> InternalizeSymbols(
    "amdgpu-internalize-symbols",
    cl::desc("Enable elimination of non-kernel functions and unused globals"),
    cl::init(false),
    cl::Hidden);

    .

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.
As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.
I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.
In terms of test results:
Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)

And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)

It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

@@ -941,6 +1016,13 @@ static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
return *TempFileOrErr;
}

static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a curiousity question. Why was this modification required?
Seems like functionally equivalent to the old flow and the way it's invoked for thinLTO also seems similar.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was because I was hitting multithreading issues, we can't use StringRefs or Args as we want on any thread, so I computed the args before the thinLTO threads are created to work around the issue. Definitely a hack and will require a real solution at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that DPC++ passes most of end-to-end tests with SPIR-V backend, should we drop the translator support in thinLTO flow? I suppose we won't hit multithreading issues with SPIR-V backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definitely sounds like a good idea to me. @asudarsa, you mentioned we were seeing 95% pass rate on E2E tests with optimization enabled with the SPIR-V backend. Do you agree with dropping support for llvm-spirv in thinLTO and only supporting the backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on that. I am planning to call for a quick meeting before we do this. I am hoping to use SPIR-V backend for new offload model while we continue to use SPIR-V translator in the old offloading model. This will make the switch to NOT impact our customers right away.

Args.hasArg(OPT_clang_backend)
? createLTO(Args, Features, OutputBitcode)
: createLTO(Args, Features);
auto LTOBackend =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like a SYCL specific change (atleast on face value). Should this go upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not technically SYCL specific, but in this PR SYCL is the only client using the new arguments.

@bader
Copy link
Contributor

bader commented Aug 27, 2024

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

I mean the pipeline built here:

/// Build a pre-link, ThinLTO-targeting default optimization pipeline to
/// a pass manager.
///
/// This adds the pre-link optimizations tuned to prepare a module for
/// a ThinLTO run. It works to minimize the IR which needs to be analyzed
/// without making irreversible decisions which could be made better during
/// the LTO run.
ModulePassManager buildThinLTOPreLinkDefaultPipeline(OptimizationLevel Level);
/// Build a ThinLTO default optimization pipeline to a pass manager.
///
/// This provides a good default optimization pipeline for link-time
/// optimization and code generation. It is particularly tuned to fit well
/// when IR coming into the LTO phase was first run through \c
/// buildThinLTOPreLinkDefaultPipeline, and the two coordinate closely.
ModulePassManager
buildThinLTODefaultPipeline(OptimizationLevel Level,
const ModuleSummaryIndex *ImportSummary);

@sarnex
Copy link
Contributor Author

sarnex commented Aug 28, 2024

Thanks all for the feedback, I'll start replying/addressing it shortly.

@sarnex
Copy link
Contributor Author

sarnex commented Aug 29, 2024

@bader

Thanks for the detailed feedback.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

Okay, will work on it.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

Correct, we need to extract this information into a data structure which will be finally used by the runtime for many things, optimized args is one case. I originally wanted to do it inside the LTO framework so that we wouldn't have to look at the IR at all outside of the framework for performance reasons (thinLTO uses BitcodeModule which is lazy-load), but since we don't have one single module anymore and we shouldn't need functions linked in to compute this information, I think it's reasonable to do it outside the framework. Let me try this out.

Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

The problem with spec constants is that based on comments from @AlexeySachkov, we need the entire IR for the module to accurately do the processing, stubs for functions in other TUs won't cut it. The only time that is true with thinLTO is inside the thinLTO framework after function importing. @AlexeySachkov Let me know if I'm wrong about this requirement.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:

Yeah really good point here, we should just be able to make/modify a pass that is called by LTO. Let me work on this.

If these ideas work I think we can clean this up a lot, thanks again. Will report back when I have more info. Thanks again.

@AlexeySachkov
Copy link
Contributor

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

In my opinion, this is a significant disadvantage of the design and should be avoided. There is an extension proposal with alternative API providing similar functionality w/o design limitations of spec constants API.

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

I agree that this is a problem and it effectively prevents use of a spec constants defined within a shared library by an application that's linked to that library.

@sarnex
Copy link
Contributor Author

sarnex commented Sep 6, 2024

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies and cross-TU dependencies not linked in, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once? We wouldn't even need to run it inside thinLTO then, we could do it before, although it might be cleaner to do it there anyway.

@AlexeySachkov
Copy link
Contributor

AlexeySachkov commented Sep 6, 2024

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once?

Right.

We wouldn't even need to run it inside thinLTO then, we could do it before or after.

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash. After is perfectly fine, though. That's how it is currently implemented except that we have full instead of thing and no optimizations :)

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

@sarnex
Copy link
Contributor Author

sarnex commented Sep 6, 2024

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash.

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations. I think just using the PreCodeGenPassesHook lambda to call it from thinLTO where everything is linked in is the cleanest anyway.

@AlexeySachkov
Copy link
Contributor

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations.

Optimizations are not a concern, importing of new functions is. We can't link two modules which were processed by spec constants pass, so it has to be run after importing, but regardless of optimizations.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie <nick.sarnie@intel.com>
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
@sarnex sarnex requested a review from a team as a code owner September 19, 2024 17:40
@sarnex
Copy link
Contributor Author

sarnex commented Sep 19, 2024

@bader @asudarsa @AlexeySachkov @RaviNarayanaswamy @mdtoguchi @maksimsab

Hey all, I just pushed a new commit with a design doc and updated prototype implementation based on the feedback. Please take a look when you have a chance. Thanks!

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very brief skim over, I haven't read the new doc yet

@@ -114,9 +114,4 @@ DEVICE_EXTERN_C void __devicelib_assert_fail(const char *expr, const char *file,
__assertfail(expr, file, line, func, 1);
}

DEVICE_EXTERN_C void _wassert(const char *_Message, const char *_File,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a change that can be merged and submitted separately. _wassert is a wrapper for MSVC's assert implementation to redirect it to ours, so it really shouldn't be implemented in fallback library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know if it's correct, I just hit a build error on windows about _wassert defined twice, probably it works today because they're all weak symbols but I remove that as part of this PR.

}

bool runOnModule(Module &M) override {
// TODO: determine if we need to run other passes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, that's an equivalent of what's being run by sycl-post-link after device code split is performed. If so, then we have the following other transformations applied at this stage:

  • ESIMD handling, which includes some special module fixup for invoke_simd, as well as potential additional split by ESIMD followed up by optional linking that back
  • Generation of a separate device image with default values of spec constants

If we also taking about what happens after llvm-link but before device code split, then it is also:

  • Something about invoke_simd
  • Sanitizer-related passes
  • Joint matrix passes

Copy link
Contributor Author

@sarnex sarnex Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when we do early splitting in -c we actually run sycl-post-link in full, including all those passes. So in that case, we only need to run passes here that need the fully linked module. If we decide to change the design such that we do only split in -c but no passes, then we would need every pass that sycl-post-link runs. In the current implementation ~2100/2200 E2E tests are passing, so it seems most passes don't need the full module and running it early does the right thing, at least for the test cases we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation ~2100/2000 E2E tests are passing, so it seems most passes don't need the full module, at least for the test cases we have.

I believe that most of E2E are single-file tests with no SYCL_EXTERNAL dependencies. Even SYCL-CTS won't help you here. I suppose that we need more or less real-life applications here to be sure and gather more data if we need it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding tests similar to sycl/test-e2e/Basic/multisource.cpp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is our general stand on this, but GH natively supports rendering of diagrams from markdown using some mermaid library. See the docs here: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams
Full docs here: https://mermaid.js.org/intro/ (and you can also use live editor there to test stuff quicker than commit-push-refresh web page)

Tagging @intel/dpcpp-doc-reviewers here for opinion if we are fine to use that functionality. To me, that looks interesting: it should be more or less readable in raw form and it is way easier to edit and update than .svg files added to the repo. So I'm for allowing to us that functionality in our docs (and actually using it instead of accepting .svg files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mermaid allows creating all of the diagrams we need to, I'm definitely in favor of it, using .svg is a nightmare, at least for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if I did this in a separate PR? It took me forever to make the current images and don't want to spend possibly forever making new ones :)

Among other changes:
- removed trailing spaces
- fixed 80-char line limitations
@bader
Copy link
Contributor

bader commented Sep 20, 2024

@sarnex, I've made minor editorial changes, but unfortunately, I can't push them to your branch. Your repository doesn't allow me to create a pull request to your private branch. If you are okay with the changes from these two commits, please, cherry-pick them to your private branch.

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation part review.

libraries) from other object files.

We rely on the assumption that all function defintions matching a declaration
will be the same and we can let ThinLTO pull in any one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ one definition rule guarantees this property of the code, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends what the original IR linkage is. if the original IR is linkonce_odr or something similar I think yes, but I don't know if we can guarantee every SYCL function will have that linkage (at least for libdevice it not this way in syclos HEAD)

ThinLTO infrastruction during the processing of `tu_bar` is valid.

As a result of running early device code split, the fat object file generated
as part of device compilation may contain multiple device code images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the linkage type of foo definitions? We need to make sure that device images are linkable i.e. foo definitions will not conflict at link time.

Can this process duplicate SYCL kernel function definitions? If so, is SYCL runtime can handle this duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After split, the linkage should be the same as it was before split.

After ThinLTO runs, it could have the same linkage as after splitting or it could be internalized (not yet implemented)

I don't think there is any way to get multiple kernel definitions in a way that isn't already possible with splitting. Maybe @AlexeySachkov has an idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After split, the linkage should be the same as it was before split.

I'm not 100% sure, but this might cause "multiple definition" problem. tu_foo has only one foo definition so using external is fine (assuming that all other modules referencing foo use different linkage types), but after split we will have foo defined in multiple modules. I'm not sure if LLVM allows linking modules where foo will have external linkage type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to try some examples and tests and see if we hit a problem like this, in my testing I've never seen a duplicate symbol problem, only undefined symbol when importing fails for some reason, but of course we may just not have a test for the failing case.

sycl/doc/design/ThinLTO.md Outdated Show resolved Hide resolved
Comment on lines +1956 to +1958
auto SpecConstArg = sycl::areSpecConstsSupported(Args, Triple)
? SpecConstantsPass::HandlingMode::native
: SpecConstantsPass::HandlingMode::emulation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should go inside the pass itself. It also can be handled by communicating with the Target via TTI to identify which mode is supported by the target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexeySachkov Any reservations about this change to SpecConstantsPass? Happy to make the change myself.

Comment on lines +1976 to +1978
// Use the legacy PM because eventually we will use the
// PreCodeGenPassesHook field of LTOConfig which requires the legacy PM.
legacy::PassManager PM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on LLVM documentation here and here, target independent middle-end optimizer is supposed to use new PM and only Target specific CodeGen PM should use legacy PM.
My understanding is that SpecConstArg is target independent pass.
If so, shouldn't we inject the pass with PostOptModuleHook instead of PreCodeGenPassesHook?

Copy link
Contributor Author

@sarnex sarnex Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we return false from PreCodeGenPassesHook thus telling the LTO framework to not continue, PostOptModuleHook is not invoked. Once we use the SPIR-V backed, this should definitively be inside `PostOptModuleHook.

I used the legacy PM here and wrote a legacy pass to make it easier to use PostOptModuleHook once we use the SPIR-V backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we return false from PreCodeGenPassesHook thus telling the LTO framework to not continue, PostOptModuleHook is not invoked.

According to my understanding LTOBackend.cpp code, the PostOptModuleHook is invoked before the PreCodeGenPassesHook.

I think we need to separate codegen to SPIR-V and running "post-link" optimization passes. SYCLLinkedModuleProcessorPass can be added even into PostImportModuleHook as we want to run standard llvm optimizations after SYCL passes handle spec constants, joint matrix, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I typo'd it, the correct response is

Since we return false from PreCodeGenModuleHook thus telling the LTO framework to not continue, PreCodeGenPassesHook is not invoked.

But sure, I'll move to the SPV backend and apply this feedback

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved
Comment on lines +622 to +624
for (OffloadFile &OffF : Candidates)
if (llvm::Triple(OffF.getBinary()->getTriple()) == Triple)
DeviceLibs.emplace_back(std::move(OffF));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (OffloadFile &OffF : Candidates)
if (llvm::Triple(OffF.getBinary()->getTriple()) == Triple)
DeviceLibs.emplace_back(std::move(OffF));
llvm::copy_if(Candidates, std::back_inserter(DeviceLibs),
[&Triple](const OffloadFile &OffF) {
return llvm::Triple(OffF.getBinary()->getTriple()) == Triple
});

Copy link
Contributor Author

@sarnex sarnex Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to std::move the elements into the new vector, will copy_if do that?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved
@@ -941,6 +1016,13 @@ static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
return *TempFileOrErr;
}

static Expected<StringRef> runLLVMToSPIRVTranslation(StringRef File,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that DPC++ passes most of end-to-end tests with SPIR-V backend, should we drop the translator support in thinLTO flow? I suppose we won't hit multithreading issues with SPIR-V backend.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved
Comment on lines +2620 to +2621
// For thinLTO, we consider device libs as normal compiler input
// and add them to the files to be processed by the LTO backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we should review with the community. As of now, they import some parts of GPU devicelib at compile time (i.e. LLVM IR). Considering community interest in moving this action to the link step, we better align with them.

Looking at some patches in upstream, I feel like some calls to GPU libraries are handled at link step (e.g. llvm/llvm-project#98512).

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@bader
Copy link
Contributor

bader commented Sep 23, 2024

For some reason I can't reply inline

You can reply inline here: #15083 (comment). When you publish a review with new discussions (1) and replies to already existing discussions (2), the replies (2) are posted in the original discussion where you can answer inline and duplicated in the review w/o the inline comment option.

@sarnex
Copy link
Contributor Author

sarnex commented Sep 23, 2024

Fixed, thx. Github UI confused me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants