-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge AtomicInteger arrays #138
Comments
@CoreRasurae What is the maximum size of this array I can put in the local memory? |
@gpeabody The maximum size of the array is defined by the local memory size. Which typically ranges from 32KiB to 48KiB depending on the GPU. So for sure 2 million entries, won't fit, however they're supposed to be split across the workgroups. |
The maximum size of the array in my case is 8192 I already use Kernel.setExplicit, but this absolutely does not make difference in the case of an atomic arrays
I definitely need to try this. But where can I find documentation and examples? |
@gpeabody Essentially what can be achieved is importing the initial values from the global memory, but avoiding from transferring back the AtomicInteger array from the kernel at the end of kernel execution, back to Java. |
That's exactly what I'm doing. What about "helper methods to access the atomics without having any direct reference inside the Kernel.run() method"? |
@gpeabody It is strange that kernel.setExplicit(true) makes no difference... If you run an Aparapi kernel that depends on non-atomic arrays only and set kernel.setExplicit(true), but you never call kernel.set(...) or kernel.get(...) does it still produce correct results?
|
This is my test
|
atomicInc (java.util.concurrent.atomic.AtomicInteger) in Kernel cannot be applied to (int) |
@gpeabody Yes, sorry I didn't test my example what you need is:
|
@gpeabody Regarding your current test kernel code, you will need to remove method getAtomics() from the kernel and all calls to it. |
Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField Process finished with exit code 1 |
@gpeabody What Aparapi version are you using? |
#Created by Apache Maven 3.3.9 |
@gpeabody Can you try with the current git code in master branch? |
@gpeabody In the master branch there is new code to deal with Java ByteCode analysis. Also if update to Aparapi 1.9.0 at least you can get extra execution performance on discrete GPUs, by doing OpenCLDevice.setSharedMemory(false). |
1.9.0 Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField Process finished with exit code 1 |
|
@gpeabody Sure Aparapi 1.9.0 will give the same issue with that Java Bytecode, but you can improve the execution time of your original kernel just using OpenCLDevice.setSharedMemory(false) on discrete GPUs. |
How I can get 1.10.0? |
@gpeabody Secondly, if you try the code in git master branch, and compile Aparapi. you will likely be able to run that modified kernel. |
I cloned 1.10.0 jar but maven does not recognize it. What should I change?
|
@gpeabody No, it won't recognize Aparapi like that, Aparapi 1.10.0 is not yet released, so it isn't on maven central repo. You will need to compile Aparapi 1.10.0-SNAPSHOT from git and then add the JARs to the maven local repository. You will need maven documentation on how to manually add a jar to your local maven repository (mvn install:install-file ...). |
Exception in thread "main" java.lang.NoClassDefFoundError: com/aparapi/natives/NativeLoader Process finished with exit code 1 |
there is really no NativeLoader in your repo:
in 1.9.0 also no have NativeLoader but don't give this issue |
@gpeabody You will also need to include aparapi-jni (https://github.com/Syncleus/aparapi-jni) in your pom.xml. You can grab it from with this:
|
Ok. thank you very much for your patience! Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField Process finished with exit code 1 |
@gpeabody Ok... We'll have to fix that... |
@gpeabody Meanwhile you can try your original version with Aparapi 1.10-SNAPSHOT and OpenCLDevice.setSharedMemory(false). It may improve the performance significantly in some cases. |
this code that you showed above works correctly on your device?
I'll try OpenCLDevice.setSharedMemory(false), but don't think that this will greatly improve the situation. |
@gpeabody I've similar code to that, but without the atomics, to reduce data transfers. |
Why you use this code for non-atomic data if you can use setExplicit? I tried to use setSharedMemory but get "non-static method cannot be referenced from a static context". Seems it must be new instance... How to use it correctly? |
@gpeabody It is true that one can use setExplicit(...), it also normal that one wants to structure the code better, that just implement everything in kernel.run() method, currently that also has the side effect of surpassing Aparapi automatic detection of variable usages (that is, if they're used for data Input, Output or both). So it is as alternative way to achieve that, and since you complain that setExplicit(...) is still transferring the results back with setExplicit(true), which I find strange... It is something that will have to checked when I find some time. You can have a look at the unit/integrations tests available in aparapi sources in src/test/java folder, you have some examples there. As an hint I can say that setSharedMemory(false) is to be called to the specific OpenCLDevice instance, the instance that represents your GPU card, before calling the kernel execute. |
As I understand after a variety of tests, the function setExplicit disables the transfer of all data except AtomicInteger[] arrays. If the AtomicInteger[] array is small then it is invisible. This becomes important only if the AtomicInteger[] array is large enough. |
results: Device isShareMempory(): false Device isShareMempory(): true |
without AtomicInteger[]: Device isShareMemory(): false Device isShareMemory(): true |
@gpeabody I will try to fix the I_ALOAD_1 bytecode issue when passing AtomicInteger parameters, and will have a look around the setExplicit(...) handling with atomic arrays, when I find some time. Currently I am with lots of work on other projects. Anyway, keep in mind that atomics will always slow down the application a bit, because they involve more complex operations than a simple sum of two integers, also you will always have to pass the initial values from AtomicInteger[] into the kernel, because there is no way in OpenCL 1.x to do synchronization across all threads... it is only possible to synchronize across the local workgroup, thus the only way for all threads to see the same initial values is to transfer the global atomic array into the GPU before starting the kernel. What you can save, is avoiding the transfer of the atomic values from the GPU back to the host at the end of kernel execution. |
Can I initialize the AtomicInteger array once, transfer it to the GPU before the kernel is first started, and after that do not transfer and do not receive it back at all? |
@gpeabody It should be possible, but only after fixing that setExplicit(...) for AtomicIntegers |
OK. Thank you for your understanding. I hope it does not take long. |
@gpeabody I can't guarantee any time frame for this at the current moment, if you're needing this soon, maybe you can try to look at the code. The relevant code is in java.com.aparapi.internal.kernel.KernelRunner class, in methods private boolean prepareAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException and private void extractAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException. |
If I change two next lines in KernelRunner.java it will be run correctly or it will damage another logic? 1040 1183 |
@gpeabody Feel free to try. I believe that, by itself, is not sufficient, you would also need to define Kernel.put(AtomicInteger[] arr) and Kernel.get(AtomicInteger[] arr) to ensure you can transfer the data to the kernel. You may also need some additional changes to that in order to ensure that even if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL. There's nothing wrong in trying small changes until it does what is needed. You can also run the validation tests to help verify that nothing else was broken. |
I don't use this arrays on host, only inside GPU. For me is no necessary Kernel.get(AtomicInteger[] arr). But if first inicialisation run on host, then I need Kernel.put(AtomicInteger[] arr). Isn't it? "if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL" |
The only way of initializing/allocate GPU GlobalMemory in OpenCL is to transfer the data from the host, so yes, you will need kernel.put(AtomicInteger[] arr), or at least transfer an empty array that will become associated with the AtomicInteger[]. |
Hello. It's me again. If I change If I only use What can I do more? |
Now it does not compile at all. |
@gpeabody Aparapi 1.10.0 will be released soon, it will include #139 which fixes one of the issues you were having:
|
@CoreRasurae Without atomics: Execution time: 346.024275 With atomicUpdate: Execution time: 444.818112 |
@CoreRasurae
I use 2 AtomicInteger arrays for 2 million indexes each. This causes about 200 milliseconds to prepare and extract these arrays.
Is it possible to avoid preparation and extraction at each execution of the kernel?
I do not need to transfer these arrays to the host, only use them on execution.
The text was updated successfully, but these errors were encountered: