Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge AtomicInteger arrays #138

Open
gpeabody opened this issue Jul 22, 2018 · 49 comments
Open

Huge AtomicInteger arrays #138

gpeabody opened this issue Jul 22, 2018 · 49 comments

Comments

@gpeabody
Copy link

@CoreRasurae
I use 2 AtomicInteger arrays for 2 million indexes each. This causes about 200 milliseconds to prepare and extract these arrays.
Is it possible to avoid preparation and extraction at each execution of the kernel?
I do not need to transfer these arrays to the host, only use them on execution.

@gpeabody Please avoid from creating questions on closed topics, instead open a new issue with your question.
Anyway regarding your question and without having any clue on how you implemented the kernel, I would suggest for you to place the AtomicInteger arrays in LocalMemory, that way they will only be initialized inside the kernel and there is no transfer overhead.

@gpeabody
Copy link
Author

@CoreRasurae What is the maximum size of this array I can put in the local memory?
I've tried, but 2 million cause a memory overflow error.
I need to use global GPU memory (not a host), since all threads work with these arrays, and not just within the workgroup.

@CoreRasurae
Copy link
Collaborator

@gpeabody The maximum size of the array is defined by the local memory size. Which typically ranges from 32KiB to 48KiB depending on the GPU. So for sure 2 million entries, won't fit, however they're supposed to be split across the workgroups.
If you need to have multiple workgroups with inter-workgroup atomic accesses then yes, you can only use global memory based atomics. In such case there is no way to avoid the data transfers, but you can reduce them, by either, using explicit data transfers from and to the kernel, or avoiding Aparapi from identifying that you are writing to the atomic arrays. For the first case, you can set Aparapi into explicit transfer mode if you use Kernel.setExplicit(...), however I believe Aparapi is missing some glue code to perform the transfer with Kernel.set(...) and Kernel.get(...). For the second case, you can use helper methods to access the atomics without having any direct reference inside the Kernel.run() method, that way Aparapi will believe that you are not changing the values in the array, and thus they are input only. You will always need to transfer the initial values to the kernel, but you don't need to transfer the results back from the kernel, which can save execution time.

@gpeabody
Copy link
Author

The maximum size of the array in my case is 8192
@Local
AtomicInteger[] atomics = new AtomicInteger[8192];
But this is not a solution, since only the workgroup is working with the array in the local memory

I already use Kernel.setExplicit, but this absolutely does not make difference in the case of an atomic arrays

For the second case, you can use helper methods to access the atomics without having any direct reference inside the Kernel.run() method, that way Aparapi will believe that you are not changing the values in the array, and thus they are input only. You will always need to transfer the initial values to the kernel, but you don't need to transfer the results back from the kernel, which can save execution time.

I definitely need to try this. But where can I find documentation and examples?

@CoreRasurae
Copy link
Collaborator

@gpeabody Essentially what can be achieved is importing the initial values from the global memory, but avoiding from transferring back the AtomicInteger array from the kernel at the end of kernel execution, back to Java.
I have never tried to use global memory without having Aparapi transfer the initial data, I believe it is not supported. It could be feasible to initialize the memory inside the kernel by using atomic OpenCL operations to initialize the atomics initial values, under a controlled manner. You can try if it works, by using Kernel.setExplicit(true); while not calling kernel.put(...) or kernel.get(...) for the AtomicIntegers.

@gpeabody
Copy link
Author

You can try if it works, by using Kernel.setExplicit(true); while not calling kernel.put(...) or kernel.get(...) for the AtomicIntegers.

That's exactly what I'm doing.

What about "helper methods to access the atomics without having any direct reference inside the Kernel.run() method"?

@CoreRasurae
Copy link
Collaborator

@gpeabody It is strange that kernel.setExplicit(true) makes no difference... If you run an Aparapi kernel that depends on non-atomic arrays only and set kernel.setExplicit(true), but you never call kernel.set(...) or kernel.get(...) does it still produce correct results?
Unfortunately there is no documentation for that, but I can provide you a simple example.

int resultsArr[] = new int[200];
int atomicsArr[] = new int[200];

public int atomicUpdate(int arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}

public void run() {
    resultsArr[0] = 0; //Ensure resultsArr is touched for Aparapi to transfer the results back  
    atomicUpdate(atomicsArr, 1); //Modifier accesses to atomicArr are in helper function, so that Aparapi will believe atomicsArr is not modified

}

@gpeabody
Copy link
Author

This is my test

import com.aparapi.Range;
import java.util.concurrent.atomic.AtomicInteger;

public class test2 {
    public static void main( String[] args )
    {
        final int size = 10000000;

        final float[] a = new float[size];
        final float[] b = new float[size];

        for (int i = 0; i < size; i++) {
            a[i] = (float) (Math.random() * 100);
            b[i] = (float) (Math.random() * 100);
        }

        final float[] sum = new float[size];

        test2Kernel kernel = new test2Kernel(size, a, b);
        Range range = Range.create(size);

        kernel.setExplicit(true);
        kernel.put(a);
        kernel.put(b);

        for (int i = 0; i < 10; i++) {
            long t1 = System.currentTimeMillis();
            kernel.execute(range);
            long t2 = System.currentTimeMillis();

            System.out.println(t2-t1 + " : " + kernel.getExecutionTime());
        }

//        kernel.get(sum);
        AtomicInteger[] counters = kernel.getAtomics();

        System.out.println("Counter = " + String.valueOf(counters[0]));

        kernel.dispose();

    }
}
import com.aparapi.Kernel;
import java.util.concurrent.atomic.AtomicInteger;

public class test2Kernel extends Kernel {
    final int size;
    final float[] a;
    final float[] b;
    float[] sum;

    AtomicInteger[] atomics = new AtomicInteger[2000000];

    public test2Kernel(int _size, float[] _a, float[] _b) {
        size = _size;
        a = _a;
        b = _b;
        sum = new float[size];

        for (int i = 0; i < atomics.length; i++) {
            atomics[i] = new AtomicInteger(0);
        }
    }

    public AtomicInteger[] getAtomics() {
        return atomics;
    }

    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicInc(atomics[0]);
    }
}

@gpeabody
Copy link
Author

gpeabody commented Jul 22, 2018

return atomicInc(arr[index]);

atomicInc (java.util.concurrent.atomic.AtomicInteger) in Kernel cannot be applied to (int)

@CoreRasurae
Copy link
Collaborator

@gpeabody Yes, sorry I didn't test my example what you need is:

int resultsArr[] = new int[200];
AtomicInteger[] atomicsArr = new AtomicInteger[200];

public int atomicUpdate(AtomicInteger arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}

@CoreRasurae
Copy link
Collaborator

@gpeabody Regarding your current test kernel code, you will need to remove method getAtomics() from the kernel and all calls to it.

@gpeabody
Copy link
Author

    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicUpdate(atomics, 1);
    }

    public int atomicUpdate(AtomicInteger arr[], int index) {
        //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
        return atomicInc(arr[index]);
    }

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField
at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:806)
at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:791)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299)
at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323)
at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:848)
at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697)
at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792)
at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1503)
at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1351)
at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1342)
at com.aparapi.Kernel.execute(Kernel.java:2856)
at com.aparapi.Kernel.execute(Kernel.java:2813)
at com.aparapi.Kernel.execute(Kernel.java:2753)
at test2.main(test2.java:29)

Process finished with exit code 1

@CoreRasurae
Copy link
Collaborator

@gpeabody What Aparapi version are you using?

@gpeabody
Copy link
Author

#Created by Apache Maven 3.3.9
version=1.8.0
groupId=com.aparapi
artifactId=aparapi

@CoreRasurae
Copy link
Collaborator

@gpeabody Can you try with the current git code in master branch?

@CoreRasurae
Copy link
Collaborator

@gpeabody In the master branch there is new code to deal with Java ByteCode analysis. Also if update to Aparapi 1.9.0 at least you can get extra execution performance on discrete GPUs, by doing OpenCLDevice.setSharedMemory(false).

@gpeabody
Copy link
Author

1.9.0
the same...

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField
at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:806)
at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:791)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299)
at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323)
at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:848)
at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697)
at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792)
at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1503)
at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1351)
at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1342)
at com.aparapi.Kernel.execute(Kernel.java:2857)
at com.aparapi.Kernel.execute(Kernel.java:2814)
at com.aparapi.Kernel.execute(Kernel.java:2754)
at test2.main(test2.java:29)

Process finished with exit code 1

@gpeabody
Copy link
Author

import com.aparapi.Kernel;
import java.util.concurrent.atomic.AtomicInteger;

public class test2Kernel extends Kernel {
    final int size;
    final float[] a;
    final float[] b;
    float[] sum;

    AtomicInteger[] atomics = new AtomicInteger[2000000];

    public test2Kernel(int _size, float[] _a, float[] _b) {
        size = _size;
        a = _a;
        b = _b;
        sum = new float[size];

        for (int i = 0; i < atomics.length; i++) {
            atomics[i] = new AtomicInteger(0);
        }
    }

//    public AtomicInteger[] getAtomics() {
//        return atomics;
//    }

    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicUpdate(atomics, 1);
    }

    public int atomicUpdate(AtomicInteger arr[], int index) {
        //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
        return atomicInc(arr[index]);
    }
}

@CoreRasurae
Copy link
Collaborator

@gpeabody Sure Aparapi 1.9.0 will give the same issue with that Java Bytecode, but you can improve the execution time of your original kernel just using OpenCLDevice.setSharedMemory(false) on discrete GPUs.

@gpeabody
Copy link
Author

gpeabody commented Jul 22, 2018

How I can get 1.10.0?
in 1.10.0 there is no this error?

@CoreRasurae
Copy link
Collaborator

@gpeabody Secondly, if you try the code in git master branch, and compile Aparapi. you will likely be able to run that modified kernel.
git clone https://github.com/Syncleus/aparapi -b master --single-branch
Then build with
mvn package

@CoreRasurae
Copy link
Collaborator

@gpeabody 1.10.0 is yet to be released. @freemo Is there a release date for this?

@gpeabody
Copy link
Author

I cloned 1.10.0 jar but maven does not recognize it. What should I change?

    <dependencies>
        <dependency>
            <groupId>com.aparapi</groupId>
            <artifactId>aparapi</artifactId>
            <version>1.10.0</version>
        </dependency>
    </dependencies>

@CoreRasurae
Copy link
Collaborator

@gpeabody No, it won't recognize Aparapi like that, Aparapi 1.10.0 is not yet released, so it isn't on maven central repo. You will need to compile Aparapi 1.10.0-SNAPSHOT from git and then add the JARs to the maven local repository.
Then you will need to change your pom.xml to point to 1.10.0-SNAPSHOT

You will need maven documentation on how to manually add a jar to your local maven repository (mvn install:install-file ...).

@gpeabody
Copy link
Author

Exception in thread "main" java.lang.NoClassDefFoundError: com/aparapi/natives/NativeLoader
at com.aparapi.internal.opencl.OpenCLLoader.(OpenCLLoader.java:43)
at com.aparapi.internal.opencl.OpenCLPlatform.getOpenCLPlatforms(OpenCLPlatform.java:73)
at com.aparapi.device.OpenCLDevice.listDevices(OpenCLDevice.java:517)
at com.aparapi.internal.kernel.KernelManager.createDefaultPreferredDevices(KernelManager.java:212)
at com.aparapi.internal.kernel.KernelManager.createDefaultPreferences(KernelManager.java:187)
at com.aparapi.internal.kernel.KernelManager.setup(KernelManager.java:55)
at com.aparapi.internal.kernel.KernelManager.(KernelManager.java:46)
at com.aparapi.internal.kernel.KernelManager.(KernelManager.java:38)
at com.aparapi.internal.kernel.KernelRunner.(KernelRunner.java:188)
at com.aparapi.Kernel.prepareKernelRunner(Kernel.java:2537)
at com.aparapi.Kernel.setExplicit(Kernel.java:3162)
at test2.main(test2.java:23)
Caused by: java.lang.ClassNotFoundException: com.aparapi.natives.NativeLoader
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Process finished with exit code 1

@gpeabody
Copy link
Author

gpeabody commented Jul 22, 2018

there is really no NativeLoader in your repo:

import com.aparapi.natives.NativeLoader;

in 1.9.0 also no have NativeLoader but don't give this issue

@CoreRasurae
Copy link
Collaborator

CoreRasurae commented Jul 22, 2018

@gpeabody You will also need to include aparapi-jni (https://github.com/Syncleus/aparapi-jni) in your pom.xml. You can grab it from with this:

<!-- https://mvnrepository.com/artifact/com.aparapi/aparapi-jni -->
<dependency>
    <groupId>com.aparapi</groupId>
    <artifactId>aparapi-jni</artifactId>
    <version>1.4.1</version>
</dependency>

@gpeabody
Copy link
Author

Ok. thank you very much for your patience!
returned to the beginning))

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField
at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:808)
at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:793)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638)
at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780)
at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299)
at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323)
at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:850)
at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697)
at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792)
at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1535)
at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1383)
at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1374)
at com.aparapi.Kernel.execute(Kernel.java:2897)
at com.aparapi.Kernel.execute(Kernel.java:2854)
at com.aparapi.Kernel.execute(Kernel.java:2794)
at test2.main(test2.java:29)

Process finished with exit code 1

@CoreRasurae
Copy link
Collaborator

@gpeabody Ok... We'll have to fix that...

@CoreRasurae
Copy link
Collaborator

@gpeabody Meanwhile you can try your original version with Aparapi 1.10-SNAPSHOT and OpenCLDevice.setSharedMemory(false). It may improve the performance significantly in some cases.

@gpeabody
Copy link
Author

this code that you showed above works correctly on your device?

int resultsArr[] = new int[200];
AtomicInteger[] atomicsArr = new AtomicInteger[200];

public int atomicUpdate(AtomicInteger arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}

public void run() {
    resultsArr[0] = 0; //Ensure resultsArr is touched for Aparapi to transfer the results back  
    atomicUpdate(atomicsArr, 1); //Modifier accesses to atomicArr are in helper function, so that Aparapi will believe atomicsArr is not modified
}

I'll try OpenCLDevice.setSharedMemory(false), but don't think that this will greatly improve the situation.
My kernel execution take 20 milliseconds without AtomicInteger[] and 200 milliseconds with it.

@CoreRasurae
Copy link
Collaborator

@gpeabody I've similar code to that, but without the atomics, to reduce data transfers.
Regarding setSharedMemory(...), atomics will always delay execution a bit, but more importantly discrete GPUs don't share their memory with the host, thus global memory accesses with have to be made through the PCIe bus, which will greatly increase the latency and thus slowdown the kernel execution. All your AtomicInteger arrays are in global memory.

@gpeabody
Copy link
Author

Why you use this code for non-atomic data if you can use setExplicit?

I tried to use setSharedMemory but get "non-static method cannot be referenced from a static context". Seems it must be new instance... How to use it correctly?

@CoreRasurae
Copy link
Collaborator

@gpeabody It is true that one can use setExplicit(...), it also normal that one wants to structure the code better, that just implement everything in kernel.run() method, currently that also has the side effect of surpassing Aparapi automatic detection of variable usages (that is, if they're used for data Input, Output or both). So it is as alternative way to achieve that, and since you complain that setExplicit(...) is still transferring the results back with setExplicit(true), which I find strange... It is something that will have to checked when I find some time.

You can have a look at the unit/integrations tests available in aparapi sources in src/test/java folder, you have some examples there. As an hint I can say that setSharedMemory(false) is to be called to the specific OpenCLDevice instance, the instance that represents your GPU card, before calling the kernel execute.

@gpeabody
Copy link
Author

As I understand after a variety of tests, the function setExplicit disables the transfer of all data except AtomicInteger[] arrays. If the AtomicInteger[] array is small then it is invisible. This becomes important only if the AtomicInteger[] array is large enough.

@gpeabody
Copy link
Author

results:

Device isShareMempory(): false
...Device name: Tahiti, Id: 3143440
Device isShareMempory(): false
...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 520386736
Execute time: 562.142399
Execute time: 100.888389
Execute time: 80.919264
Execute time: 81.842185
Execute time: 81.003895
Execute time: 80.972699
Execute time: 82.700549
Execute time: 81.55493
Execute time: 80.306763
Execute time: 81.585818

Device isShareMempory(): true
...Device name: Tahiti, Id: 4714800
Device isShareMempory(): true
...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 522310928
Execute time: 558.12825
Execute time: 101.372396
Execute time: 88.441623
Execute time: 87.150831
Execute time: 84.234118
Execute time: 81.801722
Execute time: 82.307351
Execute time: 82.104111
Execute time: 82.955064
Execute time: 82.216233

@gpeabody
Copy link
Author

gpeabody commented Jul 23, 2018

without AtomicInteger[]:

Device isShareMemory(): false
...Device name: Tahiti, Id: 3143440
Device isShareMemory(): false
...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 521107632
Execution time: 449.848156
Execution time: 18.20946
Execution time: 18.171468
Execution time: 18.020428
Execution time: 18.000968
Execution time: 18.050079
Execution time: 18.031239
Execution time: 18.105677
Execution time: 18.128225
Execution time: 18.144905

Device isShareMemory(): true
...Device name: Tahiti, Id: 2422544
Device isShareMemory(): true
...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 522877104
Execution time: 456.506282
Execution time: 18.903812
Execution time: 18.929141
Execution time: 20.588421
Execution time: 18.959719
Execution time: 18.14058
Execution time: 18.449456
Execution time: 18.148302
Execution time: 18.225212
Execution time: 18.899797

@CoreRasurae
Copy link
Collaborator

@gpeabody I will try to fix the I_ALOAD_1 bytecode issue when passing AtomicInteger parameters, and will have a look around the setExplicit(...) handling with atomic arrays, when I find some time. Currently I am with lots of work on other projects.

Anyway, keep in mind that atomics will always slow down the application a bit, because they involve more complex operations than a simple sum of two integers, also you will always have to pass the initial values from AtomicInteger[] into the kernel, because there is no way in OpenCL 1.x to do synchronization across all threads... it is only possible to synchronize across the local workgroup, thus the only way for all threads to see the same initial values is to transfer the global atomic array into the GPU before starting the kernel. What you can save, is avoiding the transfer of the atomic values from the GPU back to the host at the end of kernel execution.
Also note that transferring arrays of AtomicInteger will always be slower, because they are non-primitive types and have to be handled in a special way.

@gpeabody
Copy link
Author

Can I initialize the AtomicInteger array once, transfer it to the GPU before the kernel is first started, and after that do not transfer and do not receive it back at all?
I need it only to work in the global GPU memory.

@CoreRasurae
Copy link
Collaborator

@gpeabody It should be possible, but only after fixing that setExplicit(...) for AtomicIntegers

@gpeabody
Copy link
Author

OK. Thank you for your understanding. I hope it does not take long.

@CoreRasurae
Copy link
Collaborator

@gpeabody I can't guarantee any time frame for this at the current moment, if you're needing this soon, maybe you can try to look at the code. The relevant code is in java.com.aparapi.internal.kernel.KernelRunner class, in methods private boolean prepareAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException and private void extractAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException.
You can propose a fix.

@gpeabody
Copy link
Author

If I change two next lines in KernelRunner.java it will be run correctly or it will damage another logic?

1040
if (!explicit) extractAtomicIntegerConversionBuffer(arg);

1183
if (!explicit) prepareAtomicIntegerConversionBuffer(arg);

@CoreRasurae
Copy link
Collaborator

@gpeabody Feel free to try. I believe that, by itself, is not sufficient, you would also need to define Kernel.put(AtomicInteger[] arr) and Kernel.get(AtomicInteger[] arr) to ensure you can transfer the data to the kernel. You may also need some additional changes to that in order to ensure that even if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL. There's nothing wrong in trying small changes until it does what is needed. You can also run the validation tests to help verify that nothing else was broken.

@gpeabody
Copy link
Author

I don't use this arrays on host, only inside GPU. For me is no necessary Kernel.get(AtomicInteger[] arr). But if first inicialisation run on host, then I need Kernel.put(AtomicInteger[] arr). Isn't it?

"if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL"
How i can be sure that memory is allocated in the GPU?

@CoreRasurae
Copy link
Collaborator

The only way of initializing/allocate GPU GlobalMemory in OpenCL is to transfer the data from the host, so yes, you will need kernel.put(AtomicInteger[] arr), or at least transfer an empty array that will become associated with the AtomicInteger[].

@gpeabody
Copy link
Author

gpeabody commented Aug 1, 2018

Hello. It's me again.
I can not run the two changes I wrote above.

If I change
if (!explicit) prepareAtomicIntegerConversionBuffer(arg);
the compilation does not pass the test. I can't get the jar file.

If I only use
if (!explicit) extractAtomicIntegerConversionBuffer(arg);
the compilation goes through. My test runs twice as fast. But AtomicInteger does not work. atomicInc(atomics[0]); in the end gives 0.

What can I do more?

@gpeabody
Copy link
Author

gpeabody commented Aug 2, 2018

Now it does not compile at all.
I did a git clone again. But all the same does not compile.

@CoreRasurae
Copy link
Collaborator

@gpeabody Aparapi 1.10.0 will be released soon, it will include #139 which fixes one of the issues you were having:

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField

@gpeabody
Copy link
Author

@CoreRasurae
Hello
I've installed 1.10.1 version.
No more I_ALOAD_1 error, thank you.
But the time running of kernel the same as before. It does not matter to use atomicInc inside run() method or inside atomicUpdate() it take 66 miliseconds. Against with 15 miliseconds with no atomics.

Without atomics:

Execution time: 346.024275
Execution time: 17.025696
Execution time: 16.60802
Execution time: 16.362656
Execution time: 15.769028
Execution time: 15.703015
Execution time: 16.618698
Execution time: 18.433077
Execution time: 17.662524
Execution time: 17.416675

With atomicUpdate:

Execution time: 444.818112
Execution time: 77.709088
Execution time: 66.59614
Execution time: 66.905817
Execution time: 67.165014
Execution time: 66.602935
Execution time: 66.466299
Execution time: 68.917506
Execution time: 67.030319
Execution time: 68.720196

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants