Attempts at fixing CI #144

peastman · 2024-05-16T22:19:04Z

We're getting a variety of errors on CI. I'm going to see if I can fix them.

The Linux builds fail with the compilation error

/usr/share/miniconda3/envs/build/include/python3.7m/Python.h:44:10: fatal error: crypt.h: No such file or directory

This is apparently due to a recent change to conda-forge. Hopefully it can be fixed by installing an extra package.

The Mac builds report two errors. The OpenCL tests fail with

ld: dynamic main executables must link with libSystem.dylib for architecture x86_64
error: linker command failed with exit code 1 (use -v to see invocation)
Final linking of kernel determineNativeAccuracy failed.

Apparently you're supposed to specify -lSystem when running ld. But in this case it isn't anything we have control over. It's being called internally by PoCL when it tries to compile kernels. We don't really want it using PoCL anyway, so I'll see if I can disable it.

There's also a test failure

1/5 Test #1: TestSerializeTorchForce ..........***Failed    0.44 sec
exception: open file failed, file path: tests/forces.pt

That's probably just a path issue.

Closes #126

peastman · 2024-05-16T22:48:05Z

The Linux build with CUDA 11.8 is failing with the error

2024-05-16T22:31:48.8649760Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 87 MB
2024-05-16T22:31:48.9584994Z ##[warning]
InvalidArchiveError('Error with archive /home/runner/conda_pkgs_dir/gmpy2-2.1.5-py310hc7909c9_1.conda.  You probably need to delete and re-download or re-create this file.  Message was:\n\nfailed with error: [Errno 28] No space left on device')

Are we doing something to cause it to use an unreasonable amount of disk space? I don't see anything obvious.

RaulPPelaez · 2024-05-17T06:06:24Z

See here, I got the CUDA 11.8 one working:
#126
It needs some lines to clear space in the worker. The way this Jimver action works just takes too much memory.

For CUDA 12 I am not able to convince pytorch to find CUDA sources.

peastman · 2024-05-17T15:43:04Z

Do we really need to use that action? Since all we're doing is building CUDA code, not running it, can we get by with just the CUDA conda packages?

peastman · 2024-05-17T16:05:31Z

For some reason the Python 3.12 build keeps installing Python 3.10 instead. I can't figure out where that's coming from.

peastman · 2024-05-17T19:51:47Z

Your approach worked for CUDA 11.8. But for 12 it runs out of disk space before there's a chance to clean up after it.

2024-05-17T19:45:13.8721155Z [command]/usr/bin/sudo /home/runner/work/openmm-torch/openmm-torch/cuda_installer-linux-6.5.0-1021-azure-12.4.1/cuda_installer-linux-6.5.0-1021-azure_12.4.1.run --silent --toolkit --override
2024-05-17T19:47:31.9703576Z terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
2024-05-17T19:47:31.9705412Z   what():  boost::filesystem::copy_file: No space left on device: "./builds/nsight_compute/target/linux-desktop-glibc_2_11_3-x64/ncu", "/usr/local/cuda-12.4/nsight-compute-2024.1.1/target/linux-desktop-glibc_2_11_3-x64/ncu"

RaulPPelaez · 2024-05-21T07:27:37Z

The action is always giving us headaches, but before CUDA 12 it was the only sane way to get nvcc.
We should move to getting cuda from conda-forge, but there is no way (except cudatoolkit-dev which runs out of disk space) to get everything with CUDA<12.
#126 does not run out of space for CUDA 12. But CMake fails to find CUDA correctly.

peastman · 2024-05-21T17:56:44Z

For the OpenMM repo we instead use this script to install CUDA. I'll try using it instead.

peastman · 2024-05-21T21:37:28Z

I'm running out of patience with this. I suggest we just do the tests we can do and not worry about the rest. That means,

Don't try to build with CUDA 12. We have no way to install it.
Don't build OpenCL on Mac. We can't run the tests with Apple's OpenCL because there's no GPU, and POCL is too buggy to run them.

peastman · 2024-05-24T01:39:44Z

Ok! I finally have all tests passing. That required cutting back on what tests we run, but I think that's the best we can do for the moment. Without having actual GPUs to test on, the CI won't provide really satisfactory testing no matter what we do. It will catch what it can catch, and we'll have to do additional testing by hand.

This is ready for a first review.

RaulPPelaez · 2024-05-24T06:52:18Z

.github/workflows/CI.yml

 os: ubuntu-22.04
 cuda-version: "11.8.0"
 gcc-version: "10.3.*"
 nvcc-version: "11.8"
 python-version: "3.10"
- pytorch-version: "2.0.*"
+ pytorch-version: "2.1.*"


There is pytorch 2.3 in conda-forge.

I just tried 2.3, but it fails to install.

Could not solve for environment specs The following packages are incompatible ├─ __cuda is requested and can be installed; ├─ python 3.10** is installable with the potential options │ ├─ python [3.10.0|3.10.10|...|3.10.9], which can be installed; │ └─ python [3.10.0|3.10.1|...|3.10.9] would require │ └─ python_abi 3.10.* *_cp310, which can be installed; └─ pytorch-gpu 2.3** is not installable because there are no viable options ├─ pytorch-gpu 2.3.0 would require │ └─ pytorch 2.3.0 cuda118_py39hd44be3b_300, which requires │ ├─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported; │ └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported; └─ pytorch-gpu 2.3.0 would require └─ pytorch 2.3.0 cuda120_py38heb61fd4_300, which requires └─ cuda-version >=12.0,<13 , which requires └─ __cuda >=12 , which conflicts with any installable versions previously reported.

I tried 2.2, but it reports that version isn't available at all. I switched back to 2.1.

Pytorch 2.3 requires CUDA >=12 it seems, but only for python>3.9?. tbh I am not sure, I can never fully grasp these conda errors.

Going forward, have proposed some suggestions that may help alleviate these issues ( #146 ). Also this may help with moving to CUDA 12 when that happens

peastman added 2 commits May 16, 2024 15:11

Try to fix CI on Linux

8ee2690

Don't install POCL on Mac

3295508

peastman added 10 commits May 16, 2024 15:49

Don't install khronos-opencl-icd-loader

d999f82

Update OpenMM version

2e7e10b

Updated tested versions

dac8668

Fixed version number

e854539

Install correct packages for CUDA 12

8b73bd7

Debugging

6342238

Debugging

ae97283

Debugging

852bbba

Debugging

be7d276

Debugging

e829b39

Debugging

956711d

peastman added 5 commits May 17, 2024 10:42

Debugging

5ce95d5

Debugging

2ddca89

Debugging

2a7290f

Debugging

8d30496

Debugging

dd0b2cb

peastman added 2 commits May 21, 2024 10:48

Debugging

4137432

Debugging

7028d14

peastman added 3 commits May 21, 2024 11:00

Try different method of installing CUDA

e21862a

Try not installing CUDA packages from conda

d5c931f

Debugging

405bab7

peastman added 5 commits May 21, 2024 12:49

Debugging

767b7e1

Debugging

931bacf

Debugging

ff80350

Debugging

ed38732

Don't build OpenCL on Mac

edac428

peastman added 11 commits May 22, 2024 12:09

Don't try to run tests that can't run correctly

6f3a150

Update C++ version and minimum macOS version

41090bd

Merge branch 'master' into ci

2742f05

Debugging

28d7319

Debugging

4dedafd

Debugging

1c644b4

Debugging

6973feb

Debugging

69b9e35

Debugging

875a698

Fixed working directory for tests

c84b975

Fixes to testing

e4c5d20

peastman marked this pull request as ready for review May 24, 2024 01:39

RaulPPelaez reviewed May 24, 2024

View reviewed changes

peastman added 3 commits May 24, 2024 08:28

Try using PyTorch 2.3

3726621

Try using PyTorch 2.2

6ab463a

Switch back to PyTorch 2.1

1389ef4

peastman merged commit d447643 into openmm:master May 27, 2024
3 checks passed

peastman deleted the ci branch May 27, 2024 16:26

jakirkham mentioned this pull request Jun 8, 2024

Suggestions for CUDA 11 & 12 handling in Conda environment file #146

Open

RaulPPelaez mentioned this pull request Aug 28, 2024

Release v1.5 #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempts at fixing CI #144

Attempts at fixing CI #144

peastman commented May 16, 2024 •

edited by RaulPPelaez

Loading

peastman commented May 16, 2024

RaulPPelaez commented May 17, 2024

peastman commented May 17, 2024

peastman commented May 17, 2024

peastman commented May 17, 2024

RaulPPelaez commented May 21, 2024

peastman commented May 21, 2024

peastman commented May 21, 2024

peastman commented May 24, 2024

RaulPPelaez May 24, 2024

peastman May 24, 2024

peastman May 24, 2024

RaulPPelaez May 27, 2024 •

edited

Loading

jakirkham Jun 8, 2024

Attempts at fixing CI #144

Attempts at fixing CI #144

Conversation

peastman commented May 16, 2024 • edited by RaulPPelaez Loading

peastman commented May 16, 2024

RaulPPelaez commented May 17, 2024

peastman commented May 17, 2024

peastman commented May 17, 2024

peastman commented May 17, 2024

RaulPPelaez commented May 21, 2024

peastman commented May 21, 2024

peastman commented May 21, 2024

peastman commented May 24, 2024

RaulPPelaez May 24, 2024

Choose a reason for hiding this comment

peastman May 24, 2024

Choose a reason for hiding this comment

peastman May 24, 2024

Choose a reason for hiding this comment

RaulPPelaez May 27, 2024 • edited Loading

Choose a reason for hiding this comment

jakirkham Jun 8, 2024

Choose a reason for hiding this comment

peastman commented May 16, 2024 •

edited by RaulPPelaez

Loading

RaulPPelaez May 27, 2024 •

edited

Loading