diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 79461e616..e6c00a791 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -34,56 +34,53 @@ python run.py \ --mount_dirs {paths to modelzoo and to data} \ --python_paths {paths to modelzoo and other python code if used} ``` + See the 'Troubleshooting' section below for known issues. ## Creating an environment To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following: -1. Create the venv +### Create the venv ```bash python3.8 -m venv venv_cerebras_pt ``` -1. Install the dependencies +### Install the dependencies ```bash source venv_cerebras_pt/bin/activate pip install --upgrade pip -pip install cerebras_pytorch==2.0.2 +pip install cerebras_pytorch==2.1.1 ``` -1. Validate the setup +### Validate the setup ```bash source venv_cerebras_pt/bin/activate cerebras_install_check ``` - -## Troubleshooting - -### "Failed to transfer X out of 1943 weight tensors with modelzoo" -Sometimes jobs receive an error during the 'Transferring weights to server' like below: -``` -2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. -2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. -``` +### Modify venv files to remove clock sync check on EPCC system. Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: -1. From within your python venv, edit the /lib64/python3.8/site-packages/cerebras_pytorch/storage.py file +### From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file + ```bash -vi /lib64/python3.8/site-packages/cerebras_pytorch/storage.py -``` +vi /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py +``` + +### Navigate to line 530 -1. Navigate to line 672 ```bash -:672 +:530 ``` + The section should look like this: -``` + +```python if modified_time > self._last_modified: raise RuntimeError( f"Attempting to materialize deferred tensor with key " @@ -94,8 +91,9 @@ if modified_time > self._last_modified: ) ``` -1. Comment out the whole section -``` +### Comment out the whole section + +```python #if modified_time > self._last_modified: # raise RuntimeError( # f"Attempting to materialize deferred tensor with key " @@ -106,6 +104,42 @@ if modified_time > self._last_modified: # ) ``` -1. Save the file +### Navigate to line 774 + +```bash +:774 +``` + +The section should look like this: + +```python + if stat.st_mtime_ns > self._stat.st_mtime_ns: + raise RuntimeError( + f"Attempting to {msg} deferred tensor with key " + f"\"{self._key}\" from file {self._filepath}, but the file has " + f"since been modified. The loaded tensor value may be " + f"different from originally loaded tensor. Please refrain " + f"from modifying the file while the run is in progress." + ) +``` + +### Comment out the whole section + +```python + #if stat.st_mtime_ns > self._stat.st_mtime_ns: + # raise RuntimeError( + # f"Attempting to {msg} deferred tensor with key " + # f"\"{self._key}\" from file {self._filepath}, but the file has " + # f"since been modified. The loaded tensor value may be " + # f"different from originally loaded tensor. Please refrain " + # f"from modifying the file while the run is in progress." + # ) +``` + +### Save the file + +### Run jobs as per existing documentation. + +## Paths, PYTHONPATH and mount_dirs -1. Re-run the job +There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories) diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 18c1b5f98..6374cdc67 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -68,18 +68,19 @@ Once you have done this, your SSH key will be added to your Ultra2 account. Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2. -!!! Note +--- +!!! note "First Login" When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: - 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When prompted to enter your new password: type in a new password + 1. When prompted to re-enter the new password: re-enter the new password - 2. When prompted to enter your new password: type in a new password + Your password has now been changed - 3. When prompted to re-enter the new password: re-enter the new password - - Your password has now been changed
You will **not** use your password when logging on to Ultra2 after the initial logon. +--- ### SSH Login