-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfx1030 on ubuntu #185
Comments
if there is any OS that has the most support for gfx1030, i will use whatever OS that is. Thanks! |
i got rocm sdk builder's vllm working inside a distrobox/podman container of fedora40 inside of arch (garuda) |
Fedora 41 worked with no problems (on a gfx1101). Not quite as bleeding edge as arch but still pretty up to date. |
Hi, nice that you got it working. I am myself testing with the Ubuntu 22.04, ubuntu 24.04, Mageia 9 and Fedora 40. Sometimes also with Mint Linux. I think I will myself jump from Fedora 40 to Fedora 41 or 42 when I start updating the rocm sdk builder from 6.1.2 to newer rocm base version. (I will first integrate the xdna/npu accelerator stuff that is now working on my laptop first to this version before I start working on that) Some other have testing with Manjaro and Arch-linux, but as those are rolling distros, it's very hard to quarantee them working also in future unless you do testing all the time. If you are able to run the benchmarks and submit results, that would be nice. I have myself a different variant of rx6800 and it would be nice to compare the results and put them also to git. New results are always saved to new_results-folder and by editing the files to be loaded in |
From the few distros I've tried, an unsupported one: OpenSUSE Tumbleweed was best. I didn't compile it on there so there's probably some system links missing. OpenSUSE Tumbleweed is way more stable despite being more bleeding edge. Oh and don't get me started about yast, that is god's gift. Sure I'll test just bare in mind on my I9 9900KF it takes 16-18 hours . |
Could you possibly add Ctranslate2 into the extras? |
I can try to look for that in the weekend. |
Great, I've built it with in a docker but it uses 60 GB of storage. I'm pretty bad at programming so couldn't figure out how to build it outside the docker. Here are some resources/builds for it: One issue is that these builds use python 3.10 whilst the SDK builder uses 3.11. Thanks lamikr, you're doing god's work with rocm_sdk_buikder. Regarding testing I'll do any distro in regards to benchmarking but I'm not capable of building it on unsupported distros. These are the ones I've already successfully ran the build on:
I think the next one will be Intels Clear OS, as it's the most optimised distro around. Any requests? Cheers |
Thanks for the feedback @Bazza-63 and nice list of distros. If you have any change for checking whether one/some of these distros can be easily detected from each other in install_debs.sh and then collect a some kind of list packages that needs to be installed to be able to do the rocm_sdk_builder build, that would be very useful information. |
@Bazza-63 I added now first version of CTranslate2. It's based on to Arlo Phoenix's fork from march 2024. If you can verify/test that would be nice. I have only tested for now by building on Fedora 40. ./babs.sh -b binfo/extra/CTranslate2.binfo And then at least this worked for me:
In the long run to make the CTranslate project more easy to keep up to date, it would be good to get some ROCM specific patches to upstream version. There should be an option to specify that one ones to do the ROCM build and then the project should automatically use perl-hipify to convert NVidia specific cu-files to hip-files and then those would be used instead of cu-files in rocm case. Propably also some other modifications would be needed, but in that way it would be easier to keep the ROCM the version from CTranslate2 project up to date. |
@lamikr I'll install Fedora 41 asap and fingers crossed the full source folder i compressed works, else I'll be waiting 16-18 hours for a rebuild (I dont understand I believe the documents say a Ryzen 3800X can do it in 6 hourse, it has basically the same compute as my I9 9900KF) I'm a distro hopper so I'll be checking out some more just to see compatibility. Do you perhaps see a binary (I think thats the term) being built? I'd happy pay for a cpu farm to for the most popular cards. I already did a 64 core 112GB ram buils for my RX 7800 XT. Are there any more benchmarks you would like completing other than the included ones? |
I use myself the archive I have made from src_projects directory when I need to do a new build on another machine and that has worked well for me. But I would recommend of cleaning the builddir that contains the files that has been build from the sources. |
@ hyprbased So gfx1031 works for you ok on Fedora 40? It's strange that you had error on Ubuntu 24.04. That has worked for me also on all gpu's I have. Have you been able to test the build with gfx90c? if you have multiple GPU's, you can use environment variables like HIP_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES to select which GPU's to use. (Like export HIP_VISIBLE_DEVICES=1 before launching the python/pytorch app for example) |
@Bazza-63 Have you been able to test this? |
@lamikr yes Ctranslate2 worked on the model I needed it for. Apologies I didn't tell you sooner. I haven't gotten around to finding dependencies needed for other Linux distros. My SSD is on deaths door so need a new one so as to not kill this one with probably terabytes of new writes. Was it me you asked to push (or is it pull, again my coding ability/knowledge stops at simple python guess who cli programs) benchmarks results to this repo of my RX 7800 XT? I can see myself getting around to some of this by the end of this month. Apologies for leaving you hanging after you spent time on something I asked for. |
@lamikr I should have read back on this conversation. I assume you'd like me to test building it on the distros that the other persons suffering from failed builds? In my experience these fails happen from a missing dependency, then when install that dependency the build still fails and you have to clean install. I am unsure but I think copy and pasting the dependencies in the docs folder actually installs some that were not in the dependency scripts. Although that was months ago. Edit: I must be illiterate, the other person built it fine it just doesn't work on Ubuntu but does on Fedora am I reading this right lol. |
@Bazza-63 Thanks for confirming! And yes, it was me who asked if you can do run then the If some of the packages fails for missing build dependency, there should no be need to start building everything again from scratch. After installing the missing dependencies, it should be enough to clean just the failed project and starts it build again from scratch. So for example to delete all build files for nvtop to make sure that it's current source version is tried to build again from scratch would be done with commands:
|
@lamikr I know it should be the case that it'll build again, but for some reason after deleting the whole rocm_sdk_buikder folder it would still fail at the exact same point after trying again. It has happened twice, not recently though. Edit: do the benchmarks make use of RDNA 3 dual issue schedulers? I mean I guess we'll find out soon if you're unsure. It has ~70 tflops fp16 compared to RX 6800 XTs ~40 tflops if they are utilised. Edit edit edit: compiling it now should have results in a few days. Apologies if each edit sends you an email. |
5900X 6800M (gfx1030)
installing on garuda(arch distro) didnt work so i tried ubuntu thinking maybe thats more supported.
i made a fresh full ubuntu 24.04 gnome desktop install for running rocm_sdk_builder
final output:
/home/jordan/rocm_sdk_builder/builddir/042_python_apps_extra
post-install ok: python_apps_extra
ROCM SDK build and install ready
You can use following commands to check your GPU
rocminfo
jordan@expensivelaptop:~$ rocminfo
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.1
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen 9 5900HX with Radeon Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5900HX with Radeon Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4890
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32247296(0x1ec0e00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32247296(0x1ec0e00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32247296(0x1ec0e00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1031
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6800M
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 3072(0xc00) KB
L3: 98304(0x18000) KB
Chip ID: 29663(0x73df)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2500
BDFID: 768
Internal Node ID: 1
Compute Unit: 40
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 116
SDMA engine uCode:: 80
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1031
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Agent 3
Name: gfx90c
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
Chip ID: 5688(0x1638)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 2048
Internal Node ID: 2
Compute Unit: 8
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 471
SDMA engine uCode:: 40
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16123648(0xf60700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 16123648(0xf60700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
jupyter notebook output:
import torch
print("pytorch version: " + torch.version)
pytorch version: 2.4.1
print("ROCM HIP version: " + torch.version.hip)
ROCM HIP version: 6.1.40093-f61212abc
X_train = torch.FloatTensor([0., 1., 2.])
print("cuda device count: " + str(torch.cuda.device_count()))
cuda device count: 2
print("cuda device name: " + torch.cuda.get_device_name(0))
cuda device name: AMD Radeon RX 6800M
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device type: " + str(device))
device type: cuda
X_train = X_train.to(device)
print("Tensor training running on cuda: " + str(X_train.is_cuda))
Tensor training running on cuda: True
print("running simple model training test")
print(str(X_train))
running simple model training test
RuntimeError Traceback (most recent call last)
Cell In[7], line 2
1 print("running simple model training test")
----> 2 print(str(X_train))
File /opt/rocm_sdk_612/lib/python3.11/site-packages/torch/_tensor.py:463, in Tensor.repr(self, tensor_contents)
459 return handle_torch_function(
460 Tensor.repr, (self,), self, tensor_contents=tensor_contents
461 )
462 # All strings are unicode in Python 3.
--> 463 return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File /opt/rocm_sdk_612/lib/python3.11/site-packages/torch/_tensor_str.py:698, in _str(self, tensor_contents)
696 with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
697 guard = torch._C._DisableFuncTorch()
--> 698 return _str_intern(self, tensor_contents=tensor_contents)
File /opt/rocm_sdk_612/lib/python3.11/site-packages/torch/_tensor_str.py:618, in _str_intern(inp, tensor_contents)
616 tensor_str = _tensor_str(self.to_dense(), indent)
617 else:
--> 618 tensor_str = _tensor_str(self, indent)
620 if self.layout != torch.strided:
621 suffixes.append("layout=" + str(self.layout))
File /opt/rocm_sdk_612/lib/python3.11/site-packages/torch/_tensor_str.py:350, in _tensor_str(self, indent)
346 return _tensor_str_with_formatter(
347 self, indent, summarize, real_formatter, imag_formatter
348 )
349 else:
--> 350 formatter = _Formatter(get_summarized_data(self) if summarize else self)
351 return _tensor_str_with_formatter(self, indent, summarize, formatter)
File /opt/rocm_sdk_612/lib/python3.11/site-packages/torch/_tensor_str.py:139, in _Formatter.init(self, tensor)
135 self.max_width = max(self.max_width, len(value_str))
137 else:
138 nonzero_finite_vals = torch.masked_select(
--> 139 tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
140 )
142 if nonzero_finite_vals.numel() == 0:
143 # no valid number, do nothing
144 return
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions../pytorch_cpu_vs_gpu_simple_benchmark.sh
jordan@expensivelaptop:/opt/rocm_sdk_612/docs/examples/pytorch$ ./pytorch_cpu_vs_gpu_simple_benchmark.sh
Benchmarking CPU and GPUs
Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-f61212abc
Device: AMD Ryzen 9 5900HX with Radeon Graphics
'CPU time: 46.564 sec
Device: AMD Radeon RX 6800M
Traceback (most recent call last):
File "/opt/rocm_sdk_612/docs/examples/pytorch/pytorch_cpu_vs_gpu_simple_benchmark.py", line 61, in
a = torch.ones(mat_sz_x, mat_sz_y, device=torch_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions.jordan@expensivelaptop:/opt/rocm_sdk_612/docs/examples/pytorch$ export AMD_SERIALIZE_KERNEL=3
jordan@expensivelaptop:/opt/rocm_sdk_612/docs/examples/pytorch$ ./pytorch_cpu_vs_gpu_simple_benchmark.sh
Benchmarking CPU and GPUs
Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-f61212abc
Device: AMD Ryzen 9 5900HX with Radeon Graphics
'CPU time: 51.969 sec
Device: AMD Radeon RX 6800M
Traceback (most recent call last):
File "/opt/rocm_sdk_612/docs/examples/pytorch/pytorch_cpu_vs_gpu_simple_benchmark.py", line 61, in
a = torch.ones(mat_sz_x, mat_sz_y, device=torch_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions.The text was updated successfully, but these errors were encountered: