The main-objective of this fork is to make Rootbeer thread-safe in order to use it from Spark, see Known Bugs. It also adds some smaller bugfixes, because the original development seems to be halted (2016)
In order to do that this fork adds quite some code comments, reduces and simplifies code and also changes the code style of sighted files to something more similar to e.g. imresh i.e. braces on new line, 4 spaces instead of 2 indentation and alignment of similar and especially of boiler-plate code.
The reason behind this is trying to understand and get a feel for the code.
The Rootbeer GPU Compiler lets you use GPUs from within Java. It allows you to use almost anything from Java on the GPU:
- Composite objects with methods and fields
- Static and instance methods and fields
- Arrays of primitive and reference types of any dimension.
ROOTBEER IS PRE-PRODUCTION BETA. IF ROOTBEER WORKS FOR YOU, PLEASE LET ME KNOW AT PCPRATTS@TRIFORT.ORG
Be aware that you should not expect to get a speedup using a GPU by doing something simple like multiplying each element in an array by a scalar. Serialization time is a large bottleneck and usually you need an algorithm that is O(n^2) to O(n^3) per O(n) elements of data.
GPU PROGRAMMING IS NOT EASY, EVEN WITH ROOTBEER. EXPECT TO SPEND A MONTH OPTIMIZING TRIVIAL EXAMPLES.
FEEL FREE TO EMAIL ME FOR DISCUSSIONS BEFORE ATTEMPTING TO USE ROOTBEER
An experienced GPU developer will look at existing code and find places where control can be transfered to the GPU. Optimal performance in an application will have places with serial code and places with parallel code on the GPU. At each place that a cut can be made to transfer control to the GPU, the job needs to be sized for the GPU.
For the best performance, you should be using shared memory (NVIDIA term). The shared memory is basically a software managed cache. You want to have more threads per block, but this often requires using more shared memory. If you see the CUDA Occupancy Calculator you can see that for best occupancy you will want more threads and less shared memory. There is a tradeoff between thread count, shared memory size and register count. All of these are configurable using Rootbeer.
Kernel Interface: Your code that will run on the GPU will implement the Kernel interface.
You send data to the gpu by adding a field to the object implementing kernel. gpuMethod
will access the data.
package org.trifort.rootbeer.runtime;
public interface Kernel {
void gpuMethod();
}
This simple example uses kernel lists and no thread config or context. Rootbeer will create a thread config and select the best device automatically. If you wish to use multiple GPUs you need to pass in a Context.
ScalarAddApp.java: See the example
package org.trifort.rootbeer.examples.scalaradd;
import java.util.List;
import java.util.ArrayList;
import org.trifort.rootbeer.runtime.Kernel;
import org.trifort.rootbeer.runtime.Rootbeer;
import org.trifort.rootbeer.runtime.util.Stopwatch;
public class ScalarAddApp {
public void multArray(int[] array){
List<Kernel> tasks = new ArrayList<Kernel>();
for(int index = 0; index < array.length; ++index){
tasks.add(new ScalarAddKernel(array, index));
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.run(tasks);
}
private void printArray(String message, int[] array){
for(int i = 0; i < array.length; ++i){
System.out.println(message+" array["+i+"]: "+array[i]);
}
}
public static void main(String[] args){
ScalarAddApp app = new ScalarAddApp();
int length = 10;
int[] array = new int[length];
for(int index = 0; index < array.length; ++index){
array[index] = index;
}
app.printArray("start", array);
app.multArray(array);
app.printArray("end", array);
}
}
ScalarAddKernel:
package org.trifort.rootbeer.examples.scalaradd;
import org.trifort.rootbeer.runtime.Kernel;
public class ScalarAddKernel implements Kernel {
private int[] array;
private int index;
public ScalarAddKernel(int[] array, int index){
this.array = array;
this.index = index;
}
public void gpuMethod(){
array[index] += 1;
}
}
See the example See the slides
GPUSort.java
package org.trifort.rootbeer.sort;
import org.trifort.rootbeer.runtime.Rootbeer;
import org.trifort.rootbeer.runtime.GpuDevice;
import org.trifort.rootbeer.runtime.Context;
import org.trifort.rootbeer.runtime.ThreadConfig;
import org.trifort.rootbeer.runtime.StatsRow;
import org.trifort.rootbeer.runtime.CacheConfig;
import java.util.List;
import java.util.Arrays;
import java.util.Random;
public class GPUSort {
private int[] newArray(int size){
int[] ret = new int[size];
for(int i = 0; i < size; ++i){
ret[i] = i;
}
return ret;
}
public void checkSorted(int[] array, int outerIndex){
for(int index = 0; index < array.length; ++index){
if(array[index] != index){
for(int index2 = 0; index2 < array.length; ++index2){
System.out.println("array["+index2+"]: "+array[index2]);
}
throw new RuntimeException("not sorted: "+outerIndex);
}
}
}
public void fisherYates(int[] array)
{
Random random = new Random();
for (int i = array.length - 1; i > 0; i--){
int index = random.nextInt(i + 1);
int a = array[index];
array[index] = array[i];
array[i] = a;
}
}
public void sort(){
//should have at least 192 threads per SM
int size = 2048;
int sizeBy2 = size / 2;
//int numMultiProcessors = 14;
//int blocksPerMultiProcessor = 512;
int numMultiProcessors = 2;
int blocksPerMultiProcessor = 256;
int outerCount = numMultiProcessors*blocksPerMultiProcessor;
int[][] array = new int[outerCount][];
for(int i = 0; i < outerCount; ++i){
array[i] = newArray(size);
}
Rootbeer rootbeer = new Rootbeer();
List<GpuDevice> devices = rootbeer.getDevices();
GpuDevice device0 = devices.get(0);
//create a context with 4212880 bytes objectMemory.
//you can leave the 4212880 missing at first to
//use all available GPU memory. after you run you
//can call context0.getRequiredMemory() to see
//what value to enter here
Context context0 = device0.createContext(4212880);
//use more die area for shared memory instead of
//cache. the shared memory is a software defined
//cache that, if programmed properly, can perform
//better than the hardware cache
//see (CUDA Occupancy calculator)[http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls]
context0.setCacheConfig(CacheConfig.PREFER_SHARED);
//wire thread config for throughput mode. after
//calling buildState, the book-keeping information
//will be cached in the JNI driver
context0.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
//configure to use kernel templates. rather than
//using kernel lists where each thread has a Kernel
//object, there is only one kernel object (less memory copies)
//when using kernel templates you need to differetiate
//your data using thread/block indexes
context0.setKernel(new GPUSortKernel(array));
//cache the state and get ready for throughput mode
context0.buildState();
while(true){
//randomize the array to be sorted
for(int i = 0; i < outerCount; ++i){
fisherYates(array[i]);
}
long gpuStart = System.currentTimeMillis();
//run the cached throughput mode state.
//the data now reachable from the only
//GPUSortKernel is serialized to the GPU
context0.run();
long gpuStop = System.currentTimeMillis();
long gpuTime = gpuStop - gpuStart;
StatsRow row0 = context0.getStats();
System.out.println("serialization_time: "+row0.getSerializationTime());
System.out.println("execution_time: "+row0.getExecutionTime());
System.out.println("deserialization_time: "+row0.getDeserializationTime());
System.out.println("gpu_required_memory: "+context0.getRequiredMemory());
System.out.println("gpu_time: "+gpuTime);
for(int i = 0; i < outerCount; ++i){
checkSorted(array[i], i);
fisherYates(array[i]);
}
long cpuStart = System.currentTimeMillis();
for(int i = 0; i < outerCount; ++i){
Arrays.sort(array[i]);
}
long cpuStop = System.currentTimeMillis();
long cpuTime = cpuStop - cpuStart;
System.out.println("cpu_time: "+cpuTime);
double ratio = (double) cpuTime / (double) gpuTime;
System.out.println("ratio: "+ratio);
}
//context0.close();
}
public static void main(String[] args){
GPUSort sorter = new GPUSort();
while(true){
sorter.sort();
}
}
}
GPUSortKernel.java
package org.trifort.rootbeer.sort;
import org.trifort.rootbeer.runtime.Kernel;
import org.trifort.rootbeer.runtime.RootbeerGpu;
public class GPUSortKernel implements Kernel {
private int[][] arrays;
public GPUSortKernel(int[][] arrays){
this.arrays = arrays;
}
@Override
public void gpuMethod(){
int[] array = arrays[RootbeerGpu.getBlockIdxx()];
int index1a = RootbeerGpu.getThreadIdxx() << 1;
int index1b = index1a + 1;
int index2a = index1a - 1;
int index2b = index1a;
int index1a_shared = index1a << 2;
int index1b_shared = index1b << 2;
int index2a_shared = index2a << 2;
int index2b_shared = index2b << 2;
RootbeerGpu.setSharedInteger(index1a_shared, array[index1a]);
RootbeerGpu.setSharedInteger(index1b_shared, array[index1b]);
//outer pass
int arrayLength = array.length >> 1;
for(int i = 0; i < arrayLength; ++i){
int value1 = RootbeerGpu.getSharedInteger(index1a_shared);
int value2 = RootbeerGpu.getSharedInteger(index1b_shared);
int shared_value = value1;
if(value2 < value1){
shared_value = value2;
RootbeerGpu.setSharedInteger(index1a_shared, value2);
RootbeerGpu.setSharedInteger(index1b_shared, value1);
}
RootbeerGpu.syncthreads();
if(index2a >= 0){
value1 = RootbeerGpu.getSharedInteger(index2a_shared);
//value2 = RootbeerGpu.getSharedInteger(index2b_shared);
value2 = shared_value;
if(value2 < value1){
RootbeerGpu.setSharedInteger(index2a_shared, value2);
RootbeerGpu.setSharedInteger(index2b_shared, value1);
}
}
RootbeerGpu.syncthreads();
}
array[index1a] = RootbeerGpu.getSharedInteger(index1a_shared);
array[index1b] = RootbeerGpu.getSharedInteger(index1b_shared);
}
}
-
Download the latest Rootbeer.jar from the releases
-
Program using the Kernel, Rootbeer, GpuDevice and Context class.
-
Compile your program normally with javac.
-
Pack all the classes used into a single jar using pack
-
Compile with Rootbeer to enable the GPU
java -Xmx8g -jar Rootbeer.jar App-GPU-compiled.jar App-GPU.jar zipmerge App.jar Rootbeer.jar App-GPU-compiled.jar
-
java -jar App.jar
All together:
( cd csrc && ./compile_linux_x64 ) && ant jar && ./pack-rootbeer
To compile the CountKernel example:
( cd ../.. && ( cd csrc && ./compile_linux_x64 ) && ant clean && 'rm' -f dist/Rootbeer1.jar Rootbeer.jar && ant jar && ./pack-rootbeer ) && make clean && make -B && java -jar Count.jar
-
Clone the github repo to
rootbeer1/
-
cd rootbeer1/
-
If JNI source-code was changed, then it is necessary to recompile the normally pre-compiled binaries:
cd csrc ./compile_linux_x86 ./compile_linux_x64 ./compile_win_x86 ./compile_win_x64 ./compile_mac
-
ant jar
-
./pack-rootbeer
(linux) or./pack-rootbeer.bat
(windows) -
Use the
Rootbeer.jar
(notdist/Rootbeer1.jar
)
To compile a single jar with Rootbeer you can do:
Rootbeer.jar [Options] <mainjar> <destjar>
Option | Description |
---|---|
-printdeviceinfo |
print out information regarding your GPU |
-noarraychecks |
remove array out of bounds checks once you get your application to work |
-nodoubles |
you are telling rootbeer that there are no doubles and we can compile with older versions of CUDA |
-norecursion |
you are telling rootbeer that there are no recursions and we can compile with older versions of CUDA |
-noexceptions |
remove exception checking on GPUs. If not, then exceptions thrown on GPU will be serialized back to the host and then thrown from the Host |
-nemu |
use CPU emulator |
-jemu |
use CPU emulator |
-keepmains |
keep main methods |
-maxrregcount <number of registers per thread> |
sent to CUDA compiler to limit register count |
-shared-mem-size <size> |
specify the shared memory size |
-computecapability <architecture> |
specify the Compute Capability: sm_11 , sm_12 , sm_20 , sm_21 , sm_30 , sm_35 (default ALL) |
-32bit |
compile with 32bit |
-64bit |
compile with 64bit (if you are on a 64bit machine you will want to use just this) |
-remap-sparse |
|
-disable-class-remapping |
|
-manualcuda <path> |
Sets source file specified thereafter for compilation |
-runtests |
run all available tests but not the large memory tests |
-runeasytests |
run test suite to see if things are working |
-runtest <name> |
run specific test case |
-large-mem-tests |
run tests for large memory |
Once you get started, you will find you want to use a combination of -maxregcount
, -shared-mem-size
and the thread count sent to the GPU to control occupancy.
This syntax is not yet implemented fully yet, please use the simple syntax above!
Rootbeer.jar [Options]
Additional options available for non-simple syntax:
Option | Description |
---|---|
-mainjar <path> |
The jar which will be parsed and compiled by Rootbeer |
-libjar <path> |
Can be specified multiple times. |
-directory <path> |
Can be specified multiple times. |
-destjar <path> |
The finished jar will be written to this file |
You can use System.out.println in a limited way while on the GPU. Printing in Java requires StringBuilder support to concatenate strings/integers/etc. Rootbeer has a custom StringBuilder runtime (written with great improvements from Martin Illecker) that allows most normal printlns to work.
Since you are running on a parallel GPU, it is nice to print from a single thread
public void gpuMethod(){
if(RootbeerGpu.getThreadIdxx() == 0 && RootbeerGpu.getBlockIdxx() == 0){
System.out.println("hello world");
}
}
Once you are done debugging, you can get a performance improvement by disabling exceptions and array bounds checks (see command line options).
List<GpuDevice> devices = rootbeer.getDevices();
GpuDevice device0 = devices.get(0);
GpuDevice device1 = devices.get(1);
Context context0 = device0.createContext(4212880);
Context context1 = device1.createContext(4212880);
context0.setCacheConfig(CacheConfig.PREFER_SHARED);
context1.setCacheConfig(CacheConfig.PREFER_SHARED);
context0.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
context1.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
context0.setKernel(new GPUSortKernel(array0));
context1.setKernel(new GPUSortKernel(array1));
context0.buildState();
context1.buildState();
//run using two gpus without blocking the current thread
GpuFuture future0 = context0.runAsync();
GpuFuture future1 = context1.runAsync();
future0.take();
future1.take();
public class RootbeerGpu (){
//returns true if on the gpu
public static boolean isOnGpu();
//returns blockIdx.x * blockDim.x + threadIdx.x
public static int getThreadId();
//returns threadIdx.x
public static int getThreadIdxx();
//returns blockIdx.x
public static int getBlockIdxx();
//returns blockDim.x
public static int getBlockDimx();
//returns gridDim.x;
public static long getGridDimx();
//__syncthreads
public static void syncthreads();
//__threadfence
public static void threadfence();
//__threadfence_block
public static void threadfenceBlock();
//__threadfence_system
public static void threadfenceSystem();
//given an object, returns the long handle
//in GPU memory
public static long getRef(Object obj);
//get/set byte in shared memory. requires 1 byte.
//index is byte offset into shared memory
public static byte getSharedByte(int index);
public static void setSharedByte(int index, byte value);
//get/set char in shared memory. requires 2 bytes.
//index is byte offset into shared memory
public static char getSharedChar(int index);
public static void setSharedChar(int index, char value);
//get/set boolean in shared memory. requires 1 byte.
//index is byte offset into shared memory
public static boolean getSharedBoolean(int index);
public static void setSharedBoolean(int index, boolean value);
//get/set short in shared memory. requires 2 bytes.
//index is byte offset into shared memory
public static short getSharedShort(int index);
public static void setSharedShort(int index, short value);
//get/set integer in shared memory. requires 4 bytes.
//index is byte offset into shared memory
public static int getSharedInteger(int index);
public static void setSharedInteger(int index, int value);
//get/set long in shared memory. requires 8 bytes.
//index is byte offset into shared memory
public static long getSharedLong(int index);
public static void setSharedLong(int index, long value);
//get/set float in shared memory. requires 4 bytes.
//index is byte offset into shared memory
public static float getSharedFloat(int index);
public static void setSharedFloat(int index, float value);
//get/set double in shared memory. requires 8 bytes.
//index is byte offset into shared memory
public static double getSharedDouble(int index);
public static void setSharedDouble(int index, double value);
//atomic add value to array at index
public static void atomicAddGlobal(int[] array, int index, int value);
public static void atomicAddGlobal(long[] array, int index, long value);
public static void atomicAddGlobal(float[] array, int index, float value);
//atomic sub value from array at index
public static void atomicSubGlobal(int[] array, int index, int value);
//atomic exch value at index in array. old is retured
public static int atomicExchGlobal(int[] array, int index, int value);
public static long atomicExchGlobal(long[] array, int index, long value);
public static float atomicExchGlobal(float[] array, int index, float value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes the minimum of old and val,
//and stores the result back to memory at the same address.
//These three operations are performed in one atomic transaction.
//The function returns old."
public static int atomicMinGlobal(int[] array, int index, int value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes the maximum of old and val,
//and stores the result back to memory at the same address.
//These three operations are performed in one atomic transaction.
//The function returns old."
public static int atomicMaxGlobal(int[] array, int index, int value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes (old == compare ? val : old),
//and stores the result back to memory at the same address.
//These three operations are performed in one atomic transaction. The function
//returns old (Compare And Swap)."
public static int atomicCASGlobal(int[] array, int index, int compare, int value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes (old & val), and stores the
//result back to memory at the same address.
//These three operations are performed in one atomic transaction.
//The function returns old."
public static int atomicAndGlobal(int[] array, int index, int value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes (old | val), and stores the
//result back to memory at the same address.
//These three operations are performed in one atomic transaction.
//The function returns old."
public static int atomicOrGlobal(int[] array, int index, int value);
//from CUDA programming guide: "reads the 32-bit word old located at the
//address address in global memory, computes (old ^ val), and stores the
//result back to memory at the same address.
//These three operations are performed in one atomic transaction.
//The function returns old."
public static int atomicXorGlobal(int[] array, int index, int value);
}
CUDA code is generated and placed in ~/.rootbeer/generated.cu
You can use this to find out the register / shared memory usage
$/usr/local/cuda/bin/nvcc --ptxas-options=-v -arch sm_20 ~/.rootbeer/generated.cu
You need to have the CUDA Toolkit and CUDA Driver installed to use Rootbeer. Download it from http://www.nvidia.com/content/cuda/cuda-downloads.html
Rootbeer is licensed under the MIT license. If you use rootbeer for any reason, please star the repository and email me your usage and comments. I am preparing my dissertation now.
See here for a variety of examples.
GPU Consulting available for Rootbeer and CUDA. Please email pcpratts@trifort.org
-
fatal error: bits/c++config.h: No such file or directory
when running./Rootbeer.jar -runeasytests
Install cross-compiling libraries:sudo apt-get install gcc-4.9-multilib g++-4.9-multilib
-
The following backtrace was made with commit 7777a351917216f03 as can be seen by using
findCommitFromTrace
findCommitFromTrace --short pfuscherei \ src/org/trifort/rootbeer/runtime/Serializer.java doReadFromHeap 155 \ src/org/trifort/rootbeer/runtime/CUDAContext.java readFromHeap 452 \ src/org/trifort/rootbeer/runtime/CUDAContext.java readBlocksList 332 \ src/org/trifort/rootbeer/runtime/CUDAContext.java onEvent 308
The first commit which changes
CUDAContext.java
thereafter is0f2a7ae4c6b0e9763f84
Add comments to understand multi-GPU problem
java.lang.ClassCastException: MonteCarloPiKernel cannot be cast to [J at MonteCarloPiKernel.org_trifort_readFromHeapRefFields_MonteCarloPiKernel0(Jasmin) * the backtrace becomes a bit occluded here, because doReadFromHeap * is written out using soot in VisitorReadGen.makeMethod * called by VisitorGen.generate * called by SerializerAdder.add * called by GenerateForKernel.makeClass at MonteCarloPiKernelSerializer.doReadFromHeap(Jasmin) at org.trifort.rootbeer.runtime.Serializer.readFromHeap(Serializer.java:155) at org.trifort.rootbeer.runtime.CUDAContext.readBlocksList(CUDAContext.java:452) * this is at `case NATIVE_RUN_LIST:` at org.trifort.rootbeer.runtime.CUDAContext$GpuEventHandler.onEvent(CUDAContext.java:332) at org.trifort.rootbeer.runtime.CUDAContext$GpuEventHandler.onEvent(CUDAContext.java:308) * Down below this are multithreading related. * I.e. CudaContext.GpuEventHandler.onEvent is called concurrently. * The setup is done using the lmax.com queue with: * m_disruptor = new Disruptor<GpuEvent>( GpuEvent.EVENT_FACTORY, 64, m_exec ); * m_handler = new GpuEventHandler(); * m_disruptor.handleEventsWith( m_handler ); * m_ringBuffer = m_disruptor.start(); at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) * The command executed i.e. NATIVE_RUN_LIST is set by * CudaContext.runAsync( List<Kernel> )
Profiling an everchanging bug:
nClassCast=0 nRuntime=0 nNullPointer=0 nNoError=0 for (( i=0; i<200; i++ )); do output=$(sparkSubmit "$HOME/MontePi.jar" 268435456 2 2 2>&1 | sed '/ INFO /d') echo "$output" if echo "$output" | grep -q 'java.lang.ClassCastException:'; then nClassCast=$((nClassCast + 1)) fi if echo "$output" | grep -q 'java.lang.RuntimeException'; then nRuntime=$((nRuntime + 1)) fi if echo "$output" | grep -q 'java.lang.NullPointerException'; then nNullPointer=$((nNullPointer + 1)) fi if ! echo "$output" | grep -q 'java.lang.[A-Za-z]*Exception'; then nNoError=$((nNoError + 1)) fi echo "| nClassCast = $nClassCast " echo "| nRuntime = $nRuntime " echo "| nNullPointer = $nNullPointer" echo "| nNoError = $nNoError " done
Output:
| nClassCast = 110 | nRuntime = 39 | nNullPointer = 5 | nNoError = 46
After making
Serializer.java
thread-safe:| nClassCast = 0 | nRuntime = 0 | nNullPointer = 0 | nNoError = 200
- Phil Pratt-Szeliga http://trifort.org/
- Maximilian Knespel
Starting with the main java-file the dependency structure can be viewed with include-spider
incvis -q -C src src/org/trifort/rootbeer/runtime/Rootbeer.java
Rootbeer.java
| * Provides the Rootbeer context, which in turn provides the user API,
| * and also GPU information and methods to start GPU calculations
+- BlockShaper.java
| * used by `run` to determine the best kernel configuration
| * for the given workload i.e. number of started "kernels"(threads)
+- Context.java
| | * Defines an abstract interface which can be implemented by e.g.
| | * CUDAContext.java. A more lowlevel internal API for starting
| | * CUDA kernels.
| +- CacheConfig.java
| | * Short Java version of cudaFuncCache enum, e.g. PREFER_SHARED
| +- GpuDevice.java
| | * A Java "struct" with getter/setter-boilerplate to hold GPU
| | * information. Comparable to cudaDeviceProp.
| | * Also contains API to create a new Context for that Device.
| +- CUDAContext.java
| | * Implements the Context interface.
| | * Handles the serialized class, exception and object memory.
| | * As well as sending the kernel to the GPU and timing the
| | * execution.
| | * Several methods are Java Native methods implemented by
| | * CUDARuntime.c
| +- ../configuration/Configuration.java
| | +- ../util/ResourceReader.java
| | +- ../configuration/RootbeerPaths.java
| +- util/Stopwatch.java
| +- ../runtimegpu/GpuException.java
| | +- Sentinal.java
| +- ../generate/bytecode/Constants.java
| +- BufferPrinter.java
| | +- Memory.java
| +- CheckedFixedMemory.java
| | +- FixedMemory.java
| +- CompiledKernel.java
| | +- Serializer.java
| +- GpuEvent.java
| | +- GpuEventCommand.java
| | +- GpuFuture.java
| | +- Kernel.java
| +- StatsRow.java
| +- ThreadConfig.java
+- CUDALoader.java
+- CUDARuntime.java
| +- IRuntime.java
+- OpenCLRuntime.java
incvis -q -C src src/org/trifort/rootbeer/entry/Main.java
generate/opencl/tweaks/GencodeOptions.java
| * Provides Compute Capability and Architecture enums.
| * Can generate a set of command line options to compile for
| * all architectures that nvcc version knows
src/org/trifort/rootbeer/util/CudaPath.java
| * Searches for the path of nvcc or nvcc.exe
src/org/trifort/rootbeer/util/CmdRunner.java
| * Wrapper to correctly get the output of a command line run
entry/Main.java
+- configuration/Configuration.java
| +- util/ResourceReader.java
| | +- configuration/RootbeerPaths.java
+- runtime/CUDALoader.java
| +- runtime/Rootbeer.java
| +- runtime/BlockShaper.java
| | +- runtime/GpuDevice.java
| | +- runtime/Context.java
| | +- runtime/CacheConfig.java
| | +- runtime/GpuFuture.java
| | | +- runtimegpu/GpuException.java
| | | +- runtime/Sentinal.java
| | +- runtime/Kernel.java
| | +- runtime/StatsRow.java
| | +- runtime/CUDAContext.java
| | +- runtime/util/Stopwatch.java
| | +- com/lmax/disruptor/EventHandler.java -> not found!
| | +- com/lmax/disruptor/RingBuffer.java -> not found!
| | +- com/lmax/disruptor/dsl/Disruptor.java -> not found!
| | +- generate/bytecode/Constants.java
| | +- runtime/BufferPrinter.java
| | | +- runtime/Memory.java
| | +- runtime/CheckedFixedMemory.java
| | | +- runtime/FixedMemory.java
| | | +- org/omg/CORBA/_IDLTypeStub.java -> not found!
| | +- runtime/CompiledKernel.java
| | | +- runtime/Serializer.java
| | +- runtime/GpuEvent.java
| | | +- com/lmax/disruptor/EventFactory.java -> not found!
| | | +- runtime/GpuEventCommand.java
| | +- runtime/ThreadConfig.java
| +- runtime/CUDARuntime.java
| | +- runtime/IRuntime.java
| +- runtime/OpenCLRuntime.java
+- entry/RootbeerCompiler.java
| compiler/*.java -> not found!
| generate/opencl/tweaks/CudaTweaks.java
| compressor/Compressor.java
| org/antlr/runtime/ANTLRStringStream.java -> not found!
| org/antlr/runtime/CommonTokenStream.java -> not found!
| org/antlr/runtime/NoViableAltException.java -> not found!
| org/antlr/runtime/RecognitionException.java -> not found!
| org/antlr/runtime/Token.java -> not found!
| compressor/OpenCLLexer.java
| org/antlr/runtime/*.java -> not found!
| compressor/OpenCLParser.java
| org/antlr/runtime/*.java -> not found!
| org/antlr/runtime/tree/*.java -> not found!
| deadmethods/DeadMethods.java
| util/ReadFile.java
| deadmethods/Block.java
| deadmethods/BlockParser.java
| deadmethods/Segment.java
| deadmethods/SegmentParser.java
| deadmethods/Method.java
| deadmethods/LiveMethodDetector.java
| deadmethods/MethodAnnotator.java
| deadmethods/MethodNameCompressor.java
| deadmethods/MethodNameParser.java
| util/CompilerRunner.java
| util/CudaPath.java
| util/WindowsCompile.java
| generate/opencl/tweaks/CompileResult.java
| util/CmdRunner.java
| generate/opencl/tweaks/ParallelCompile.java
| runtime/BlockingQueue.java
| generate/opencl/tweaks/ParallelCompileJob.java
| generate/opencl/tweaks/Tweaks.java
| generate/opencl/tweaks/NativeCpuTweaks.java
| util/*.java -> not found!
| pack/Pack.java -> not found!
| soot/*.java -> not found!
| soot/options/Options.java -> not found!
| soot/rbclassload/DfsInfo.java -> not found!
| soot/rbclassload/ListClassTester.java -> not found!
| soot/rbclassload/ListMethodTester.java -> not found!
| soot/rbclassload/MethodTester.java -> not found!
| soot/rbclassload/RootbeerClassLoader.java -> not found!
| soot/util/JasminOutputStream.java -> not found!
| CompilerSetup.java -> not found!
| ForcedFields.java -> not found!
| KernelEntryPointDetector.java -> not found!
| MainTester.java -> not found!
| RootbeerDfs.java -> not found!
| TestCaseEntryPointDetector.java -> not found!
| TestCaseFollowTester.java -> not found!
+- entry/RootbeerTest.java
| test/RootbeerTestAgent.java
| runtime/RootbeerGpu.java
| util/ForceGC.java
| test/ApplicationMain.java
| test/TestApplication.java
| test/TestApplicationFactory.java
| test/ChangeThread.java
| testcases/rootbeertest/gpurequired/ChangeThreadTest.java
| test/TestSerialization.java
| testcases/rootbeertest/gpurequired/ChangeThreadRunOnGpu.java
| test/TestSerializationFactory.java
| test/ExMain.java
| test/TestException.java
| test/TestExceptionFactory.java
| testcases/rootbeertest/exception/NullPointer1Test.java
| testcases/rootbeertest/exception/NullPointer1RunOnGpu.java
| testcases/rootbeertest/exception/NullPointer2Test.java
| testcases/rootbeertest/exception/NullPointer2RunOnGpu.java
| testcases/rootbeertest/exception/NullPointer2Object.java
| testcases/rootbeertest/gpurequired/ExceptionBasicTest.java
| testcases/rootbeertest/gpurequired/ExceptionBasicRunOnGpu.java
| testcases/rootbeertest/gpurequired/ExceptionTestException.java
| test/KernelTemplateMain.java
| testcases/rootbeertest/kerneltemplate/DoubleToStringKernelTemplateBuilderTest.java
| test/TestKernelTemplate.java
| testcases/rootbeertest/kerneltemplate/DoubleToStringKernelTemplateBuilderRunOnGpu.java
| testcases/rootbeertest/kerneltemplate/DoubleToStringKernelTemplateTest.java
| testcases/rootbeertest/kerneltemplate/DoubleToStringKernelTemplateRunOnGpu.java
| testcases/rootbeertest/kerneltemplate/FastMatrixTest.java
| testcases/rootbeertest/kerneltemplate/MatrixKernel.java
| testcases/rootbeertest/kerneltemplate/GpuParametersTest.java
| testcases/rootbeertest/kerneltemplate/GpuParametersRunOnGpu.java
| testcases/rootbeertest/kerneltemplate/GpuVectorMapTest2.java
| testcases/rootbeertest/kerneltemplate/GpuVectorMap2.java
| testcases/rootbeertest/kerneltemplate/GpuLongVectorPair.java
| testcases/rootbeertest/kerneltemplate/GpuVectorMapRunOnGpu2.java
| test/TestKernelTemplateFactory.java
| test/LoadTestSerialization.java
| test/Main.java
| testcases/rootbeertest/SuperClass.java
| testcases/rootbeertest/SuperClassRunOnGpu.java
| testcases/otherpackage/CompositeClass6.java
| testcases/otherpackage/CompositeClass5.java
| testcases/rootbeertest/CompositeClass4.java
| testcases/rootbeertest/CompositeClass3.java
| testcases/rootbeertest/CompositeClass2.java
| testcases/rootbeertest/CompositeClass1.java
| testcases/rootbeertest/CompositeClass0.java
| testcases/rootbeertest/arraysum/ArraySumTest.java
| testcases/rootbeertest/arraysum/ArraySum.java
| testcases/rootbeertest/canonical/CanonicalTest.java
| testcases/rootbeertest/canonical/CanonicalKernel.java
| testcases/rootbeertest/canonical2/CanonicalObject.java
| testcases/rootbeertest/canonical2/CanonicalArrays.java
| testcases/rootbeertest/exception/NullPointer4Test.java
| testcases/rootbeertest/exception/NullPointer4RunOnGpu.java
| testcases/rootbeertest/exception/NullPointer4Object.java
| testcases/rootbeertest/gpurequired/*.java -> not found!
| testcases/rootbeertest/remaptest/RemapTest.java
| testcases/rootbeertest/remaptest/RemapRunOnGpu.java
| testcases/rootbeertest/remaptest/CallsPrivateMethod.java
| testcases/rootbeertest/serialization/*.java -> not found!
| TestSerialization.java -> not found!
| TestSerializationFactory.java -> not found!
| util/CurrJarName.java
| soot/G.java -> not found!
| soot/Modifier.java -> not found!
| entry/JarClassLoader.java
- entry/Main.main
- entry/Main.parseArgs
- entry/RootbeerCompiler.compile
- entry/RootbeerCompiler.setupSoot
- entry/RootbeerCompiler.compileForKernels
- compiler/Transform2.run
- generate/bytecode/GenerateForKernel.makeClass
- generate/bytecode/GenerateForKernel.makeGpuBody
- generate/opencl/OpenCLScene.getCudaCode
- generate/opencl/OpenCLScene.makeSourceCode
- generate/opencl/OpenCLScene.methodBodiesString
- generate/opencl/tweaks/CudaTweaks.compileProgram
- deadmethods/DeadMethods.parseString
- deadmethods/DeadMethods.getResult
- deadmethods/LiveMethodDetector.parse
- generate/opencl/tweaks/ParallelCompile.compile
- generate/opencl/tweaks/ParallelCompile.run
- generate/opencl/tweaks/ParallelCompileJob.compile
- generate/opencl/tweaks/ParallelCompileJob.getResult
- writeJimpleFile
- writeClassFile
- makeOutJar
- runtime/Rootbeer.Rootbeer
- runtime/Kernel.Kernel
- runtime/GpuDevice.createContext
- runtime/CUDAContext.CUDAContext
- runtime/CUDAContext.c:allocateNativeContext
- runtime/CUDAContext.setMemorySize
- runtime/CUDAContext.c:initializeDriver
- runtime/CUDAContext.setThreadConfig
- runtime/CUDAContext.setKernel
- runtime/CUDAContext.buildState
- runtime/CUDAContext.readCubinFile
- runtime/FixedMemory.FixedMemory
- runtime/CUDAContext.GpuEventHandler.onEvent:NATIVE_BUILD_STATE
- runtime/CUDAContext.c:nativeBuildState
- cuDeviceGet, cuCtxCreate, cuModuleLoadFatBinary
- runtime/FixedMemory.getAddress
- runtime/FixedMemory.getSize
- cuMemAlloc, cuParamSet*, cuFuncSetBlockShape
- runtime/CUDAContext.run
- runtime/CUDAContext.runAsync
- runtime/GpuEvent.setKernelList
- runtime/GpuEvent.setValue
- runtime/CUDAContext.GpuEventHandler.onEvent:NATIVE_RUN_LIST
- runtime/CUDAContext.writeBlocksList
- runtime/Serializer.doWriteStaticsToHeap Soot generated method, not yet fully understood See generate/bytecode/VisitorWriteGenStatic:makeMethod
- runtime/Serializer.doWriteToHeap See generate/bytecode/VisitorWriteGen:makeWriteToHeapMethod
- runtime/CUDAContext.runGpu
- runtime/CUDAContext.c:cudaRun
- cuModuleGetGlobal, cuMemcpyHtoD, cuLaunchGrid, cuMemcpyDtoH
- runtime/CUDAContext.readBlocksList
- runtime/Serializer.doReadStaticsFromHeap
- runtime/FixedMemory.readRef
- runtime/Serializer.readFromHeap
- runtime/CUDAContext.writeBlocksList
- runtime/GpuFuture.take
- runtime/CUDAContext.close
These libraries are distributed with this repo which bloats it's size and guarantees for at least one copyright problem.
See folder lib
.
-
org.apache.maven.*
Used by Rootbeer Maven Plugin -
antlr-3.1.3.jar
LinkANTLR is an exceptionally powerful and flexible tool for parsing formal languages.
org.antrl.*
Used by./src/org/trifort/rootbeer/compressor/*
-
pack.jar
Linkmerges jars to one fat jar (same as zipmerge) plus it removes the line
Class-Path:.*
fromMETA-INF/MANIFEST.MF
from the main jar (would be the last jar with the zipmerge command).pack.*
-
sootclasses-rbclassload.jar
Link fork from Soot. First commit is 8eb2f460593d0bc8fcd65f8f22e7c677daef2872 and the last merge to soot is commit 978a7fed3af16177be93d3b19be2a6e446023788 from 2013-05-03. It is unclear whether this jar was compiled from branchmaster
orfeature/rbclassload2
. The last soot release is soot-2.5.0 from 2012-11-15, but there seems to be still active development and nightly builds in 2016-07-22.org.soot.*
Commits addedgit log --no-merges --stat --author="pcpratts" --name-only --pretty=format:"" | sort -u
are mainlysoot.rbclassload.*
and the folderpaper
.
Dependencies by Soot (see build.xml in Soot repository) or try searching for imports:
-
asm-debug-all-5.0.3.jar
LinkASM is an all purpose Java bytecode manipulation and analysis framework. It can be used to modify existing classes or dynamically generate classes, directly in binary form.
org.objectweb.asm.*
-
AXMLPrinter2.jar
LinkPrints XML document from binary XML file
org.xmlpull.*
android.util.*
android.content.*
-
commons-io-2.4.jar
Linkorg.apache.commons.io.*
-
dexlib2-2.0.3-dev.jar
Linkorg.jf.dexlib2.*
-
disruptor-3.3.0.jar
Link Github Technical paperA High Performance Inter-Thread Messaging Library
com.lmax.disruptor.*
-
guava-18.0.jar
Linkcom.google.common.*
-
hamcrest-all-1.3.jar
LinkHamcrest is a framework for writing matcher objects allowing 'match' rules to be defined declaratively. This is a dependency of soot.
org.hamcrest.*
com.thoughtworks.qdox.*` -
jasminclasses-2.5.0.jar
Linkjas.*
jasmin.*
scm.*
-
polyglotclasses-1.3.5.jar
Linkpolyglot.*
ppg.*
java_cup.*
-
util-2.0.3-dev.jar
Linkds.tree.*
org.jf.util.*
Without this this error appears:Exception in thread "main" java.lang.NoClassDefFoundError: org/jf/util/ExceptionWithContext at soot.toDex.DexPrinter.<init>(DexPrinter.java:141) at soot.PackManager.<init>(PackManager.java:529) at soot.Singletons.soot_PackManager(Singletons.java:472) at soot.PackManager.v(PackManager.java:344) at soot.PhaseOptions.getPM(PhaseOptions.java:39) at soot.PhaseOptions.getPhaseOptions(PhaseOptions.java:49) at soot.coffi.CoffiMethodSource.getBody(CoffiMethodSource.java:49) at soot.SootMethod.getBodyFromMethodSource(SootMethod.java:91) at soot.SootMethod.retrieveActiveBody(SootMethod.java:324) at soot.rbclassload.RootbeerClassLoader.loadScene(RootbeerClassLoader.java:857) at soot.rbclassload.RootbeerClassLoader.loadNecessaryClasses(RootbeerClassLoader.java:320) at org.trifort.rootbeer.entry.RootbeerCompiler.setupSoot(RootbeerCompiler.java:215)
Note thate
DexPrinter.java
does not exist in soot-rb/master, only in branchfeature/rbclassload2
, but there line 141 is not inside the constructor.With a bash script
findCommitFromTrace
we can try to find the commit the original author used for compilingfindSootCommitFromTrace.log
suggests that it is one of these commits:2a6ddca fd94e35 9127ab1 0a275e3 90774eb 05f5183 084e95e 02a365b
PackManager.java
is actually only correct for2a6ddca
, andSingletons.java
doesn't seem to match any commits available in soot. So a custom soot build seems to have been used. The problem is, that almost none of the files in any commit in soot-rb matches. The jar therefore can not be reproduced, although it could be tried if it works to merge2a6ddca
into thefeature/rbclassload2
branch and then compile it.In neither branches of
soot-rb
does this yield any matches, but searching in the original soot develop branch yields:No match at commit 9fdd641 No match at commit 4e93aaa Found match at commit 53fb3b3: setActiveBody(this.getBodyFromMethodSource("jb")); Found match at commit b6305e0: setActiveBody(this.getBodyFromMethodSource("jb")); No match at commit 820678f No match at commit 8959d45 Found match at commit 42b4ef1: setActiveBody(this.getBodyFromMethodSource("jb")); Found match at commit dc8005e: setActiveBody(this.getBodyFromMethodSource("jb")); Found match at commit 23bb5d6: setActiveBody(this.getBodyFromMethodSource("jb")); No match at commit c7f5f88 No match at commit c41d165
Then doing this again with:
file=src/soot/coffi/CoffiMethodSource.java keyword=getPhaseOptions lineNumber=49
Some classes do have main functions for testing purposes. Start them e.g. with
java -classpath build/classes/ org.trifort.rootbeer.runtime.BlockShaper
Note that the second argument may not be different because of the package
keyword at the top of this file. Instead adjust the classpath if necessary!
At many places, especially in Serializer.java
, generate/opencl/*.c
, FixedMemory.java
and many others bitshifts were used.
After a while I came to the the understanding that those compressed offsets for the manually managed heap and in order to save space the least 4 bits were deleted and the resulting number then cast to int to save space.
There are so many problems with this approach. At first it only bugged me because of the complexity, because it could have easily been written like: "address /= nBytesAlignment" and then be clear and the alignment could be saved at one place (or maybe two for the C source files) also.
But there are also no checks if value might be too large to cast to int which happens for INT_MAX*16
i.e. after 32GB. A limitation not mentioned anywhere.
The bigger problem is how null pointers are treated. A null pointer is defined in Serializer.java
to return a relative address -1
. This negative value gets alos bitshifted back and forth, being problematic as that depends on how negative numbers are encoded and or how those bitshifts interact with those (in C++ shifting negative numbers is different from unsigned bitshifts).
Also bitshift left and right is not reversible as this simple experiment shows:
/*
javac refcompress.java && java refcompress
*/
class refcompress
{
private static void testCompressLong( long l )
{
long l2 = l >> 4;
int i = (int) l2;
long l3 = i;
long l4 = l3 << 4;
System.out.println( "l = " + l );
System.out.println( "l2 = " + l2 );
System.out.println( "i = " + i );
System.out.println( "l3 = " + l3 );
System.out.println( "l4 = " + l4 );
System.out.println();
}
public static void main( String[] args )
{
testCompressLong( -1 );
testCompressLong( 64 );
}
}
Output:
l = -1 l2 = -1 i = -1 l3 = -1 l4 = -16
l = 64 l2 = 4 i = 4 l3 = 4 l4 = 64
This is a problem e.g. in Serializer.java
All the output of that example suggests that -1
is encoded as 0xFFFF FFFF FFFF FFFF
for long and 0xFFFF FFFF
for int. I.e. counting backwards from LONG_MAX
. Resulting in 0xFFFF FFFF FFFF FFF0 = -16
after the bitshift back.
find rootbeer1 -name '*.java' -o -name '*.c' | xargs -I{} grep --color -H -n '[*/][^a-zA-Z0-9]*16' '{}'
find rootbeer1 -name '*.java' -o -name '*.c' | xargs -I{} grep --color -H -n '\(<<\|>>\)[^a-zA-Z0-9]*4' '{}'
Mixing bitshifting with /
and *
is only possible with positive numbers!
Singletons in Rootbeer 'grep' -r '\.v[ \t]*([ \t]*)' src/ | sed -r 's|.*[^0-9A-Za-z]([0-9A-Za-z]*)[ \t]*\.v[ \t]*\([ \t]*\).*|\1|' | sort | uniq | xargs -I{} find src/ -name '{}.java'
:
- generate/opencl/NameMangling.java
- generate/opencl/OpenCLScene.java
- generate/bytecode/RegisterNamer.java
- configuration/RootbeerPaths.java
- generate/opencl/tweaks/Tweaks.java
- Jimple
- NameMangling
- Options
- Printer
- RootbeerClassLoader
- StringNumbers
=> Many of the found singletons in the generate
-folder and therefore hopefully only used when compiling, except RootbeerPaths, but that shouldn't be critical or lead to the observered exception.
I don't know if all of the used libraries are thread-safe, e.g. com.lmac.disruptor
Watch out for static variables! 'grep' -rn 'static' csrc/ src/ > statics.log
and then clean output for static methods and finale static member variables.
Note that final static variables are still modifiable (except for primitive datatypes)! (This is unlike C++, or rather behavior like Object *
in C++. Therefore final static types need to be checked whether they are also [mut](http://www.javapractices.com/topic/TopicAction.do?Id=29][able](http://stackoverflow.com/questions/5886439/what-is-the-java-equivalent-of-cs-const-member-function). E.g. final Strings are OK, but not Lists!
The important classes are those in runtime
or more generically all needed for running the program. @todo sort out classes needed only for compiling and don't merge those in the final fat jar to save space!
All memory is managed by rootbeer in these variables:
__shared__ char m_shared[%%shared_mem_size%%];
struct m_Local {
/**
* Pointer to memory with objects aligned to 16 byte boundaries
*/
unsigned long long dpObjectMem, // 0
unsigned long long objectMemSizeDiv16, // 1
/**
* memory for reda-only members of classes
*/
unsigned long long dpClassMem // 2
}
src/org/trifort/rootbeer/
+- compiler
+- compressor
+- configuration
+- deadmethods
| * Classes needed to find methods in the generated C source code which
| * are unused.
| * 1. lines or smaller units of strings are categorized into
| * TYPE_COMMENT, TYPE_STRING, ... in so called Segments.
| * 2. segments are conjoined to so called Blocks. One block may be a
| * function body or function declaration or a preprocessor directive.
| * 3. Get the name of each method and look create a call graph,
| * filtering unused methods for performance reasons?
+- entry
+- generate
+- generate
| +- bytecode
| +- permissiongraph
| +- opencl
| +- body
| +- fields
| +- tweaks
+- remap
+- runtime
| +- nemu
| +- gpu
| +- util
+- runtimegpu
+- test
+- testcases
| +- everything
| +- otherpackage
| +- otherpackage2
| +- rootbeertest
| | +- arraysum
| | +- baseconversion
| | +- canonical
| | +- canonical2
| | +- exception
| | +- gpurequired
| | +- kerneltemplate
| | +- math
| | +- ofcoarse
| | +- remaptest
| | +- serialization
+- util
- add compiler warnings
- streamlining pack-rootbeer
- check which metrics do change on a changed kernel, e.g. when adding a local variable
- CompiledKernel backtrace verfolgen in Rootbeer und vlt. einfach so Fehler schon suchen oder Debug-Methoden einbauen