Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Support Ubuntu 20.04 #512

Open
garymm opened this issue Aug 7, 2020 · 15 comments
Open

Support Ubuntu 20.04 #512

garymm opened this issue Aug 7, 2020 · 15 comments
Assignees

Comments

@garymm
Copy link
Contributor

garymm commented Aug 7, 2020

Ubuntu 20.04 LTS was released on April 23, 2020. It would be nice to support this latest LTS version.

Here's what I've needed to do to get version 0.11 working on ubuntu 20.04:
sudo apt install libncurses5 libtinfo5

So maybe just adding that to the installation instructions for now would be a good start. Updating the code to support the newer libs would be another option.

@garymm
Copy link
Contributor Author

garymm commented Aug 12, 2020

It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.

@marcrasi
Copy link
Contributor

It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.

Can you tell me what specifically you did to encounter this problem, so that I can make sure that the ubuntu20.04 builds don't have this problem?

@garymm
Copy link
Contributor Author

garymm commented Aug 12, 2020

Tried running swift-jupyter as described here.

When starting the kernel, I saw errors like:

[I 09:42:54.199 NotebookApp] Kernel started: 1a8e1196-b812-4582-9bf8-e42fe72ef654, name: swift
         Traceback (most recent call last):
  File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
ModuleNotFoundError: No module named '_lldb'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):           File "/home/garymm/src/swift-jupyter/swift_kernel.py", line 19, in <module>
    import lldb
           File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 38, in <module>
    from . import _lldb
ImportError: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
[I 09:42:57.200 NotebookApp] KernelRestarter: restarting kernel (1/5), new random ports
                Traceback (most recent call last):
  File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
        ModuleNotFoundError: No module named '_lldb'

@garymm
Copy link
Contributor Author

garymm commented Aug 22, 2020

I think the issue of python 3.6 vs 3.8 was a symptom of me trying to use a release that was built on Ubuntu 18.04 on 20.04.

I built the toolchain from source and got a build to succeed on 20.04 with CUDA 11.0 and CUDNN 8.0.2. The only real bug I had to fix is described here:
https://groups.google.com/a/tensorflow.org/g/swift/c/RUlBncvPRfE

@marcrasi
Copy link
Contributor

I made some progress: #535

I'm still waiting on https://gitlab.com/nvidia/container-images/cuda/-/issues/83 before I can add cuda toolchains for ubuntu 20.04.

@brettkoonce
Copy link
Contributor

@marcrasi toolchains have been updated!

@marcrasi
Copy link
Contributor

I tried to make a CUDA build for ubuntu20.04, but there is still a small blocker: The version of TF that we use (2.3) supports CUDA 11.0 but not CUDA 11.1, and nvidia publishes docker images for ubuntu20.04 CUDA 11.1 but not CUDA 11.0.

I'm not sure if TF 2.4 supports CUDA 11.1, but I'll try again once we upgrade to TF 2.4 (which we're trying to do soon)

@brettkoonce
Copy link
Contributor

@marcrasi it's my understanding that 2.4 is the first release that officially supports cuda 11.0 (https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0), not sure how you got 11.0 working in the first place (a master pull?). Cuda 11.1 is the release that supports the new ampere consumer cards (11.0 is just for the a100 series), so it would be nice to have that in particular (tensorflow/tensorflow#44750). 11.2 is already out as well!

@brettkoonce
Copy link
Contributor

also, @texasmichelle

you might run this and look at the logs being spit out:

export GPU_TYPE="a100"
export ZONE="us-central1-a"

gcloud compute instances create s4tf-ubuntu-${GPU_TYPE} \
  --zone=${ZONE} \
  --image-project=deeplearning-platform-release \
  --image-family=swift-latest-gpu-ubuntu-1804 \
  --maintenance-policy=TERMINATE \
  --accelerator="type=nvidia-tesla-${GPU_TYPE},count=1" \
  --metadata="install-nvidia-driver=True" \
  --machine-type=a2-highgpu-1g \
  --boot-disk-size=256GB

@texasmichelle
Copy link
Member

@brettkoonce Can you share what you're seeing? I'm getting a warning about disk size, but otherwise that command seems to be working. Are you running in a project that has quota?

@texasmichelle
Copy link
Member

Or are you pointing this out as an example of a toolchain running with cuda 11 support?

@brettkoonce
Copy link
Contributor

@texasmichelle I was seeing some weird errors when running swift-models (eg lenet-mnist), but in retrospect what's going on is that I think you packaged the 10.2 cuda version with your deep learning build. After pulling the cuda 11 build (eg swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz) everything works fine. It might be worth considering moving to 11.0 going forward. Still seeing tensorflow/swift-models#704 fwiw.

@texasmichelle
Copy link
Member

ah, I see what you mean. I also tried using --image-family=swift-latest-cu110-ubuntu-1804, which seems fine on the tensorflow-0.12 branch of swift-models. However, I can see that the 0.12 release hasn't made it into the images yet. There's currently a code freeze for the holidays, but I'll see if I can get a more precise date on the next release. I submitted the change a few weeks ago, so I believe the code is ready otherwise.

@texasmichelle
Copy link
Member

@brettkoonce You can expect to see DLVMs with v0.12 right after the freeze, e.g. by Jan. 8.

I also verified that cuda 11.0 is included in the existing toolchain and will remain going forward.

@machineko
Copy link

machineko commented Dec 29, 2020

1 week ago =>

Ubuntu20.04 x86_64 cudnn images have been pushed! Having an issue with arm64 and ppc64le builds though. Will close this once those are released.

So could we got ubuntu precompiled with cuda (preferably 11.1 version for amper support :D [
nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04] ), or we still need to wait for 11.1 version in the master Tensorflow repo?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants