Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Update CUDA and improve status page with diagnostics #11

Merged
merged 20 commits into from
Feb 5, 2024
Merged

Conversation

mmcky
Copy link
Contributor

@mmcky mmcky commented Jan 31, 2024

fixes #10 #5

Copy link

netlify bot commented Jan 31, 2024

Deploy Preview for timely-seahorse-68815c ready!

Name Link
🔨 Latest commit b790186
🔍 Latest deploy log https://app.netlify.com/sites/timely-seahorse-68815c/deploys/65c048bbbd1b61000847f480
😎 Deploy Preview https://deploy-preview-11--timely-seahorse-68815c.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

github-actions bot commented Jan 31, 2024

@github-actions github-actions bot temporarily deployed to pull request January 31, 2024 02:06 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Jan 31, 2024

  • rerun with jax[_local] installed instead of jax[_pip] as per 9f96343

@github-actions github-actions bot temporarily deployed to pull request January 31, 2024 04:57 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Feb 1, 2024

  • re-run with latest docker (switched numpyro and jax) -- it looks like numpyro is causing linking issues in jax.

@github-actions github-actions bot temporarily deployed to pull request February 1, 2024 07:47 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 1, 2024 23:53 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Feb 1, 2024

@HumphreyYang this is now installing torch and pyro-ppl in the Docker container then installing jax to ensure appropriate CUDA linking for JAX. It is running with gpu enabled as reported in https://65bc2ecf929de00cbf2d4378--timely-seahorse-68815c.netlify.app/status

I am surprised though to see the runtime now 5000s so maybe torch is not working as expected.

  • build a CUDA=12.1 docker image with torch, pyro-ppl and jax to see if we can improve the runtime with proper driver linking. We need to install jax in the 12.1 context without 12.3 driver downloads as a whl

Update: the above configuration still has an issue with pytorch and jax in the same conda environment. I have just come across the following:
jax-ml/jax#18032
so it is a known issue that is currently not resolved. I think what we have now using CUDA=12.3 might be the best we can achieve without loading two separate conda environments (at the lecture lecture).

@mmcky
Copy link
Contributor Author

mmcky commented Feb 2, 2024

  • test CUDA=11.8 as both work in that context

down from 7 to 1 incompatible dependency between the two

torch 2.2.0+cu118 requires nvidia-cudnn-cu11==8.7.0.84; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu11 8.9.6.50 which is incompatible.
  • need to review where docker is running out of memory for this test. Doesn't appear linked to the hdd allocation

@mmcky
Copy link
Contributor Author

mmcky commented Feb 5, 2024

@github-actions github-actions bot temporarily deployed to pull request February 5, 2024 02:10 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Feb 5, 2024

@kp992 I will merge this PR to remove the CUDA failure message that is currently live on stats.quantecon.org and then I will open an issue to update the docker image to use mmcky/quantecon-lecture-python:cuda-12.3.1-anaconda-2023-09-py311 once we have removed pytorch and pyro successfully.

@mmcky mmcky merged commit 359a1ac into main Feb 5, 2024
5 checks passed
@mmcky mmcky deleted the fix-cuda branch February 5, 2024 02:33
@github-actions github-actions bot temporarily deployed to pull request February 5, 2024 04:22 Inactive
@kp992
Copy link
Contributor

kp992 commented Feb 5, 2024

Sounds good to me. Thanks @mmcky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: CUDA + JAX
2 participants