[Queries] Regarding usage of LLVM built with Pretrained Models and Development Mode #350

quic-garvgupt · 2024-08-13T14:24:15Z

quic-garvgupt
Aug 13, 2024

Hi,

I have successfully built a toolchain using the model inlining-Oz-v1.1 released [here(https://github.com/google/ml-compiler-opt/releases). However, I have some queries regarding its usage while building an application in release mode, as well as some questions pertaining to development mode.

Release Mode

Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.
While building llvm for release mode, is it necessary to disable LTO?
After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

Development Mode

While building llvm to generate the corpus for training mode, is it necessary to disable LTO?
When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?
The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?

Answered by mtrofin

Sep 24, 2024

I observed at best a 0.5% improvement [...]

... which may be actually all that can be squeezed - for small projects that are already hyper-optimized for size, there's only so much headroom left.

Since I am dealing with multiple small codebases [...]

Try combining the corpora instead - i.e. from N small corpora (which you already extracted), you consolidate them all into one. Then do quantization ("vocab"), then training, on that combined one.

Another possibility is to use the ComPile database of IR modules. @boomanaiden154, does that come with a way to get a corpus.json? There are more nuances to discuss here - in fact, it'd be an interesting to explore methodology here: as a hypothes…

View full answer

boomanaiden154 · 2024-08-13T14:55:18Z

boomanaiden154
Aug 13, 2024
Collaborator

Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.

It depends upon if you're doing (Thin)LTO. If you're not using (Thin)LTO, then you should be fine omitting it from the linker. If you are using some form of LTO, then you need to pass it to the linker too so that it will use the policy for inlining there.

While building llvm for release mode, is it necessary to disable LTO?

No. The build options for LLVM should not matter.

After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

There shouldn't be anything major. You can use (Thin)LTO to build the application. You just need to make sure to pass the flag to the linker too so that it will use the correct inlining policy. The policy might also change in effectiveness when going to LTO, depending upon the corpus that it was trained on.

While building llvm to generate the corpus for training mode, is it necessary to disable LTO?

No. You should be able to use pretty much whatever build options you like for LLVM.

When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?

Ideally it should be representative of how you build your application in production. If you don't use (Thin)LTO there, then training on a (Thin)LTO corpus does not make sense. If you do, then training on a non-(Thin)LTO corpus does not make a lot of sense.

The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?

It's not a LLVM flag. It would be flags/different scripts within this repository that drive the training pipeline. The demos are currently written to use PPO. There is no end-to-end script that uses ES for training currently, although getting one written isn't too big of a deal now that most of the ES stuff is upstreamed.

0 replies

mtrofin · 2024-08-13T15:06:10Z

mtrofin
Aug 13, 2024
Maintainer

Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.

I assume your build performs some kind of LTO. There's no hard and fast rule, I'd experiment with/without enabling in the backend optimization. FWIW, the model we have here was trained assuming no LTO step.

While building llvm for release mode, is it necessary to disable LTO?

(IIUC this is about building e.g. clang itself) Shouldn't be necessary to disable anything, all that building llvm with release model does is (insofar the linker is concerned) add a precompiled library and a .o. We build clang with ThinLTO, and embed (release mode) both inliner and regalloc models. However, we don't have a full LTO scenario. Is something broken when doing a full LTO of clang and trying to embed a model?

After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

No restrictions, but "mileage may vary": for example, if you trained on a corpus of post-thinLTO IR modules, you'll get best results when applying that model to similar modules. One culprit to this is that features get quantized (bucketized), and if the distribution of feature values is too far off, benefits would degrade.

While building llvm to generate the corpus for training mode, is it necessary to disable LTO?

You don't need LLVM built in any different way, btw, to collect a corpus. The functionality for corpus collection is in any build of clang. The main thing is to use the same compiler version (i.e. from the same llvm repo githash) when collecting the corpus as when later compiling it, just to avoid things like IR breaking changes. So you could use the clang you build for training, for example.

When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?

You can collect the IR corpus from either before the pre-thinlink compilation or from post-thinlink. We never do anything with LTO, only ThinLTO, so we never added that support to full LTO. To answer your question, it's less about how you build that application and more about which IR you want to train on.

If your scenario involves ThinLTO, I'd recommend starting by training on the post-link IR first - i.e. have the normal inliner in the frontend, and ML in the backend. Then it gets tricky and you need to experiment - you could stop there (i.e. if you get reasonable savings, just use ML in the post-thinlink); or try the model in both front and back; or you could build a second corpus from the frontend IR and continue training there; or (probably best) collect the 2 corpora first, do quantization on them, then train on one and then finetune on the other. We did the "train mostly on the back, finetune in front" without quantization for Chrome on Android 32 bit (@Northbadge did that and he can correct me if I misremember), and only "back" for 64 bit, for example (that bit is fresher in memory, @alekh @tvmarino's work).

The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?

Not yet, but I have a cludge that demonstrates using ES in my fork: https://github.com/mtrofin/ml-compiler-opt/tree/es. Focus on "cludge". @boomanaiden154 has, I think, a plan to bring ES into the fold cleanly.

0 replies

mtrofin · 2024-08-13T15:07:04Z

mtrofin
Aug 13, 2024
Maintainer

Oh, just saw @boomanaiden154 also replied. Sorry for some duplicate info!

0 replies

quic-garvgupt · 2024-08-14T12:44:43Z

quic-garvgupt
Aug 14, 2024
Author

Thank you for the detailed response, @mtrofin and @boomanaiden154.

However, we don't have a full LTO scenario. Is something broken when doing a full LTO of clang and trying to embed a model?

I was going through demo and noticed LTO being disabled in developnment mode hence the question. I am building clang without any LTO as well however wanted to clarify is this was a necessity or can we be build clang with any options.

You can collect the IR corpus from either before the pre-thinlink compilation or from post-thinlink. We never do anything with LTO, only ThinLTO, so we never added that support to full LTO.

I missed the point that corpus collection is only supported for thin LTO at the moment and not full LTO.

Not yet, but I have a cludge that demonstrates using ES in my fork: https://github.com/mtrofin/ml-compiler-opt/tree/es. Focus on "cludge". @boomanaiden154 has, I think, a plan to bring ES into the fold cleanly.

Thanks for sharing this. I'll definitely try this out. Just to be clear, this follows the same instructions as mentioned the demo and it will use ES strategy to train the model?

I will keep this issue open for some time while I work on this project in case I have any further queries or comments. Thank you once again!

0 replies

mtrofin · 2024-08-14T14:19:37Z

mtrofin
Aug 14, 2024
Maintainer

I missed the point that corpus collection is only supported for thin LTO at the moment and not full LTO.

...and no-lto (i.e. just frontend - like, IIUC, your scenario)

Just to be clear, this follows the same instructions as mentioned the demo and it will use ES strategy to train the model?

In broad strokes, yes, i.e. if you treat the training script as a black box, then everything else should be the same; but I'd recommend checking (like debugging or print-ing in the python code) to make sure you're doing ES - like I said, that branch is a cludge (note the absence of tests, for instance :) )

0 replies

quic-garvgupt · 2024-08-19T12:45:48Z

quic-garvgupt
Aug 19, 2024
Author

...and no-lto (i.e. just frontend - like, IIUC, your scenario)

To clarify, is there currently support for corpus collection at both thin LTO and no LTO? Apologies for asking this again, but I interpreted your response as "corpus collection is only supported for thin LTO and no LTO".

As you mentioned, I am building the application with no LTO, and after I do extract_ir, my corpus description contains no modules. So wanted to know if I am messing at some place or if corpus extraction for application built with no LTO is not supported. If this is the case, could you provide some suggestion on how to enable corpus extraction for application built with no LTO?

0 replies

mtrofin · 2024-08-19T14:58:09Z

mtrofin
Aug 19, 2024
Maintainer

Yup, see here: https://github.com/llvm/llvm-project/blob/main/llvm/utils/mlgo-utils/mlgo/corpus/extract_ir.py#L12

no LTO: compile with -fembed-bitcode=all (potentially passing it after -Xclang, depends if your compile steps invoke the driver or directly compiler)
distributed thinLTO: pass to the backend clang invocation, -mllvm -lto-embed-bitcode=post-merge-pre-opt
local thinlto: pass the linker -Wl,--save-temps=import -Wl,--thinlto-emit-index-files

There are some more nuances with local thinlto, if you chase the --thinlto_build flag in the script, those should become clear.

0 replies

quic-garvgupt · 2024-08-21T16:57:35Z

quic-garvgupt
Aug 21, 2024
Author

Thanks for your response!

I successfully extracted IR, generated a corpus, and trained a warmstart model. Currently, the training of the RL model is still in progress. I want to get a rough idea of the training time because the data I’m using is fairly small—only 88 modules, as mentioned in the info after trace collection.

I0821 09:36:28.822783 140070467577600 generate_default_trace.py:202] 88 success, 0 failed out of 88 88 of 88 modules succeeded, and 39 trainining examples written

It took about 45 minutes to train the warmstart model, and it has been more than 8 hours since the RL model training began. Is there any rough estimate of how long the training might take for the above number of modules on a 32-core machine with 64 GB of RAM? I am using the default set of parameters for the model as mentioned in /local/mnt/workspace/garvgupt/ml-compiler-opt/compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin file.

Additionally, since I am still a novice in model engineering, any advice on what values to set or how to decide the values for the parameters mentioned in the above gin file for the small training dataset would be appreciated. TIA

0 replies

mtrofin · 2024-08-21T18:18:15Z

mtrofin
Aug 21, 2024
Maintainer

Is there any rough estimate of how long the training might take for the above number of modules on a 32-core machine with 64 GB of RAM?

If you look at the tensorboard progression of the reward, especially since you are (IIUC) processing at each pass the entire corpus, that (tensorboard) should give you an indication (e.g. if it's not making much progress in improving the reward anymore, it probably learned enough).

You could also try the current saved model (it's under the output directory - make sure you don't pick the one called collect_ something)

Additionally, since I am still a novice in model engineering, any advice on what values to set or how to decide the values for the parameters mentioned in the above gin file for the small training dataset would be appreciated. TIA

IIRC we did a hyperparameter sweep using xmanager. The infra should be easily adaptable to that - and we did, internally, but haven't yet pushed upstream. But all that says is "trial and error", really.

0 replies

quic-garvgupt · 2024-08-27T12:06:07Z

quic-garvgupt
Aug 27, 2024
Author

In the graph above, rewards and the mean of rewards are plotted. Towards the end, it flattens out. Does this mean there is not much left to learn? I am also unsure what a reward of 0 indicates.

0 replies

mtrofin · 2024-08-27T14:44:09Z

mtrofin
Aug 27, 2024
Maintainer

You want to look at reward_distribution. It should look like it's asymptotically reaching some positive value.

(this is mentioned in passing in the inlining demo, if you search for "tensorboard")

0 replies

quic-garvgupt · 2024-09-23T12:07:03Z

quic-garvgupt
Sep 23, 2024
Author

Hi

The code size and perf number mentioned in the below spreadsheets are with which of the pre-trained models from the release page?

Are there any plans to release more pre-trained models for other architecture including Arm and RISCV?

Have there been any attempts to train a model on zephyr for riscv by someone from MLGO community or from some other individual contributor owners might be aware of?

Can the demo be extended to build Fuchsia and train a model for it on RISC-V as well? Currently, it only supports x64, and the same instructions can be used for ARM64 with minimal changes. However, for RISC-V, it seems that more modifications are needed. Given that code size is crucial for embedded systems where RISC-V is predominantly used, being able to train a model on Fuchsia for RISC-V would be a significant advantage. Please let me know if you need anything from my side for this. Additionally, I can create a separate issue to gather more input from the community if needed.

2 replies

mtrofin Sep 23, 2024
Maintainer

Hi

The code size and perf number mentioned in the below spreadsheets are with which of the pre-trained models from the release page?

https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true

https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true

Those look like the links from the original RFC. I think those are off the "v0.1" inline-oz model.

Are there any plans to release more pre-trained models for other architecture including Arm and RISCV?

No. The "released" models are for reference, and we pushed them whenever we made some significant updates to the feature set. They aren't meant to replace a model trained on a more specific corpus.

Have there been any attempts to train a model on zephyr for riscv by someone from MLGO community or from some other individual contributor owners might be aware of?

Can the demo be extended to build Fuchsia and train a model for it on RISC-V as well? Currently, it only supports x64, and the same instructions can be used for ARM64 with minimal changes. However, for RISC-V, it seems that more modifications are needed. Given that code size is crucial for embedded systems where RISC-V is predominantly used, being able to train a model on Fuchsia for RISC-V would be a significant advantage. Please let me know if you need anything from my side for this. Additionally, I can create a separate issue to gather more input from the community if needed.

@petrhosek about Fuchsia and RISC-V.

Note that the demo is just meant to showcase the build steps. If your goal is to train a model for a specific RISC-V - targeting codebase, why not train on that directly? See earlier, model effectiveness depends on the feature values having similar distributions during inference as they do during training, so if your code is vastly different from Fuchsia's, this is something to keep in mind.

quic-garvgupt Sep 24, 2024
Author

See #350 (comment), model effectiveness depends on the feature values having similar distributions during inference as they do during training, so if your code is vastly different from Fuchsia's, this is something to keep in mind.

I understand this and was attempting it myself. However, the codebases we are training our model on are very small, with only about 400 training samples after trace generation. Using the default hyperparameter values from the repository, I observed at best a 0.5% improvement. Even after tweaking the hyperparameters, the improvement did not exceed this value.

Since I am dealing with multiple small codebases, I tried combining their training by first training the model on one codebase and then using this as the warm-start policy for training on another codebase. However, I noticed that the reward distribution curve was mostly negative when training on successive codebases, which I interpreted as the model not having much to learn from. I am unsure if this is the correct way to combine training on multiple codebases.

Please let me know if you have any insights on combining the training process for multiple codebases or any other suggestions on my approach.

mtrofin · 2024-09-24T15:07:19Z

mtrofin
Sep 24, 2024
Maintainer

I observed at best a 0.5% improvement [...]

... which may be actually all that can be squeezed - for small projects that are already hyper-optimized for size, there's only so much headroom left.

Since I am dealing with multiple small codebases [...]

Try combining the corpora instead - i.e. from N small corpora (which you already extracted), you consolidate them all into one. Then do quantization ("vocab"), then training, on that combined one.

Another possibility is to use the ComPile database of IR modules. @boomanaiden154, does that come with a way to get a corpus.json? There are more nuances to discuss here - in fact, it'd be an interesting to explore methodology here: as a hypothesis, if we collected the vocab for each small corpus and measure the distance (Euclidian) between them; then how would a model trained on the whole ComPile compare to a model trained on the combined corpus; or combined + those elements in ComPile with features within the radius of the corpus... etc (lots of hand waviness here on my end)

Anyway, I'd suggest starting with combining the corpora from your project though :)

3 replies

boomanaiden154 Sep 24, 2024
Collaborator

It doesn't come with a way by default. It wouldn't be too difficult to write out all the .bc files to disk though and create a corpus_description.json. Maybe ~100 lines of Python to load the parquet files and dump all the contents.

quic-garvgupt Sep 24, 2024
Author

Try combining the corpora instead - i.e. from N small corpora (which you already extracted), you consolidate them all into one. Then do quantization ("vocab"), then training, on that combined one.

I have a basic question: If the output of vocabulary generation is not used for the final model training, what insights do we gain from this information? Additionally, since it is recommended to perform quantization before training when combining the corpus, will there be any changes in the training process to incorporate this in the final training step? Thanks for your prompt reply!

mtrofin Sep 24, 2024
Maintainer

If the output of vocabulary generation is not used for the final model training

It is used. E.g. see here.

quic-garvgupt · 2024-11-12T06:20:58Z

quic-garvgupt
Nov 12, 2024
Author

Thanks @mtrofin and @boomanaiden154 for answering all my queries promptly. I have no further questions and will close this thread.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Queries] Regarding usage of LLVM built with Pretrained Models and Development Mode #350

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 14 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Queries] Regarding usage of LLVM built with Pretrained Models and Development Mode #350

quic-garvgupt Aug 13, 2024

Replies: 14 comments · 5 replies

boomanaiden154 Aug 13, 2024 Collaborator

mtrofin Aug 13, 2024 Maintainer

mtrofin Aug 13, 2024 Maintainer

quic-garvgupt Aug 14, 2024 Author

mtrofin Aug 14, 2024 Maintainer

quic-garvgupt Aug 19, 2024 Author

mtrofin Aug 19, 2024 Maintainer

quic-garvgupt Aug 21, 2024 Author

mtrofin Aug 21, 2024 Maintainer

quic-garvgupt Aug 27, 2024 Author

mtrofin Aug 27, 2024 Maintainer

quic-garvgupt Sep 23, 2024 Author

mtrofin Sep 23, 2024 Maintainer

quic-garvgupt Sep 24, 2024 Author

mtrofin Sep 24, 2024 Maintainer

boomanaiden154 Sep 24, 2024 Collaborator

quic-garvgupt Sep 24, 2024 Author

mtrofin Sep 24, 2024 Maintainer

quic-garvgupt Nov 12, 2024 Author

quic-garvgupt
Aug 13, 2024

Replies: 14 comments 5 replies

boomanaiden154
Aug 13, 2024
Collaborator

mtrofin
Aug 13, 2024
Maintainer

mtrofin
Aug 13, 2024
Maintainer

quic-garvgupt
Aug 14, 2024
Author

mtrofin
Aug 14, 2024
Maintainer

quic-garvgupt
Aug 19, 2024
Author

mtrofin
Aug 19, 2024
Maintainer

quic-garvgupt
Aug 21, 2024
Author

mtrofin
Aug 21, 2024
Maintainer

quic-garvgupt
Aug 27, 2024
Author

mtrofin
Aug 27, 2024
Maintainer

quic-garvgupt
Sep 23, 2024
Author

mtrofin Sep 23, 2024
Maintainer

quic-garvgupt Sep 24, 2024
Author

mtrofin
Sep 24, 2024
Maintainer

boomanaiden154 Sep 24, 2024
Collaborator

quic-garvgupt Sep 24, 2024
Author

mtrofin Sep 24, 2024
Maintainer

quic-garvgupt
Nov 12, 2024
Author