Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempts at fixing CI #144

Merged
merged 42 commits into from
May 27, 2024
Merged

Attempts at fixing CI #144

merged 42 commits into from
May 27, 2024

Conversation

peastman
Copy link
Member

@peastman peastman commented May 16, 2024

We're getting a variety of errors on CI. I'm going to see if I can fix them.

The Linux builds fail with the compilation error

/usr/share/miniconda3/envs/build/include/python3.7m/Python.h:44:10: fatal error: crypt.h: No such file or directory

This is apparently due to a recent change to conda-forge. Hopefully it can be fixed by installing an extra package.

The Mac builds report two errors. The OpenCL tests fail with

ld: dynamic main executables must link with libSystem.dylib for architecture x86_64
error: linker command failed with exit code 1 (use -v to see invocation)
Final linking of kernel determineNativeAccuracy failed.

Apparently you're supposed to specify -lSystem when running ld. But in this case it isn't anything we have control over. It's being called internally by PoCL when it tries to compile kernels. We don't really want it using PoCL anyway, so I'll see if I can disable it.

There's also a test failure

1/5 Test #1: TestSerializeTorchForce ..........***Failed    0.44 sec
exception: open file failed, file path: tests/forces.pt

That's probably just a path issue.

Closes #126

@peastman
Copy link
Member Author

The Linux build with CUDA 11.8 is failing with the error

2024-05-16T22:31:48.8649760Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 87 MB
2024-05-16T22:31:48.9584994Z ##[warning]
InvalidArchiveError('Error with archive /home/runner/conda_pkgs_dir/gmpy2-2.1.5-py310hc7909c9_1.conda.  You probably need to delete and re-download or re-create this file.  Message was:\n\nfailed with error: [Errno 28] No space left on device')

Are we doing something to cause it to use an unreasonable amount of disk space? I don't see anything obvious.

@RaulPPelaez
Copy link
Contributor

See here, I got the CUDA 11.8 one working:
#126
It needs some lines to clear space in the worker. The way this Jimver action works just takes too much memory.

For CUDA 12 I am not able to convince pytorch to find CUDA sources.

@peastman
Copy link
Member Author

Do we really need to use that action? Since all we're doing is building CUDA code, not running it, can we get by with just the CUDA conda packages?

@peastman
Copy link
Member Author

For some reason the Python 3.12 build keeps installing Python 3.10 instead. I can't figure out where that's coming from.

@peastman
Copy link
Member Author

Your approach worked for CUDA 11.8. But for 12 it runs out of disk space before there's a chance to clean up after it.

2024-05-17T19:45:13.8721155Z [command]/usr/bin/sudo /home/runner/work/openmm-torch/openmm-torch/cuda_installer-linux-6.5.0-1021-azure-12.4.1/cuda_installer-linux-6.5.0-1021-azure_12.4.1.run --silent --toolkit --override
2024-05-17T19:47:31.9703576Z terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
2024-05-17T19:47:31.9705412Z   what():  boost::filesystem::copy_file: No space left on device: "./builds/nsight_compute/target/linux-desktop-glibc_2_11_3-x64/ncu", "/usr/local/cuda-12.4/nsight-compute-2024.1.1/target/linux-desktop-glibc_2_11_3-x64/ncu"

@RaulPPelaez
Copy link
Contributor

The action is always giving us headaches, but before CUDA 12 it was the only sane way to get nvcc.
We should move to getting cuda from conda-forge, but there is no way (except cudatoolkit-dev which runs out of disk space) to get everything with CUDA<12.
#126 does not run out of space for CUDA 12. But CMake fails to find CUDA correctly.

@peastman
Copy link
Member Author

For the OpenMM repo we instead use this script to install CUDA. I'll try using it instead.

@peastman
Copy link
Member Author

I'm running out of patience with this. I suggest we just do the tests we can do and not worry about the rest. That means,

  • Don't try to build with CUDA 12. We have no way to install it.
  • Don't build OpenCL on Mac. We can't run the tests with Apple's OpenCL because there's no GPU, and POCL is too buggy to run them.

@peastman
Copy link
Member Author

Ok! I finally have all tests passing. That required cutting back on what tests we run, but I think that's the best we can do for the moment. Without having actual GPUs to test on, the CI won't provide really satisfactory testing no matter what we do. It will catch what it can catch, and we'll have to do additional testing by hand.

This is ready for a first review.

@peastman peastman marked this pull request as ready for review May 24, 2024 01:39
os: ubuntu-22.04
cuda-version: "11.8.0"
gcc-version: "10.3.*"
nvcc-version: "11.8"
python-version: "3.10"
pytorch-version: "2.0.*"
pytorch-version: "2.1.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is pytorch 2.3 in conda-forge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried 2.3, but it fails to install.

Could not solve for environment specs
The following packages are incompatible
├─ __cuda is requested and can be installed;
├─ python 3.10**  is installable with the potential options
│  ├─ python [3.10.0|3.10.10|...|3.10.9], which can be installed;
│  └─ python [3.10.0|3.10.1|...|3.10.9] would require
│     └─ python_abi 3.10.* *_cp310, which can be installed;
└─ pytorch-gpu 2.3**  is not installable because there are no viable options
   ├─ pytorch-gpu 2.3.0 would require
   │  └─ pytorch 2.3.0 cuda118_py39hd44be3b_300, which requires
   │     ├─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   │     └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported;
   └─ pytorch-gpu 2.3.0 would require
      └─ pytorch 2.3.0 cuda120_py38heb61fd4_300, which requires
         └─ cuda-version >=12.0,<13 , which requires
            └─ __cuda >=12 , which conflicts with any installable versions previously reported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried 2.2, but it reports that version isn't available at all. I switched back to 2.1.

Copy link
Contributor

@RaulPPelaez RaulPPelaez May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch 2.3 requires CUDA >=12 it seems, but only for python>3.9?. tbh I am not sure, I can never fully grasp these conda errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going forward, have proposed some suggestions that may help alleviate these issues ( #146 ). Also this may help with moving to CUDA 12 when that happens

@peastman peastman merged commit d447643 into openmm:master May 27, 2024
3 checks passed
@peastman peastman deleted the ci branch May 27, 2024 16:26
@RaulPPelaez RaulPPelaez mentioned this pull request Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants