Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsync generates MPI error when I'm not the owner of the source path #550

Open
Aelmazaty opened this issue Jun 13, 2023 · 0 comments
Open

Comments

@Aelmazaty
Copy link

Hello,

I've installed mpifileutils version 0.11.1 using spack.
I always get an MPI error when I am not the owner of the source file/directory. Although I have at least read permissions.
The files are copied however this error is still generated. It's annoying as it was submitted as an LSF or SLURM job it will be mrked as failed.
No errors are generated if I am the owner of the source.

Example:
[aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage
-rw-r--r-- 1 root root 17035 Sep 5 2022 /hps/scratch/sysinf/power_usage

[aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage /hps/scratch/sysinf/aelmazaty/
[2023-06-13T16:01:14] Walking source path
[2023-06-13T16:01:14] Walking /hps/scratch/sysinf/power_usage
[2023-06-13T16:01:14] Walked 1 items in 0.001 secs (882.196 items/sec) ...
[2023-06-13T16:01:14] Walked 1 items in 0.001 seconds (818.132 items/sec)
[2023-06-13T16:01:14] Walking destination path
[2023-06-13T16:01:14] Walking /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:01:14] Walked 1 items in 0.002 secs (617.520 items/sec) ...
[2023-06-13T16:01:14] Walked 1 items in 0.002 seconds (606.374 items/sec)
[2023-06-13T16:01:14] Comparing file sizes and modification times of 1 items
[2023-06-13T16:01:14] Started : Jun-13-2023, 16:01:14
[2023-06-13T16:01:14] Completed : Jun-13-2023, 16:01:14
[2023-06-13T16:01:14] Seconds : 0.000
[2023-06-13T16:01:14] Items : 1
[2023-06-13T16:01:14] Item Rate : 1 items in 0.000158 seconds (6310.263012 items/sec)
[2023-06-13T16:01:14] Updating timestamps on newly copied files

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[11234,1],0]
Exit code: 1

The file is copied however an error is generated

When I try with a file I own:
[aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage_aelmazaty
-rw-r--r-- 1 aelmazaty systems 17035 Jun 13 13:54 /hps/scratch/sysinf/power_usage_aelmazaty
[aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage_aelmazaty /hps/scratch/sysinf/aelmazaty/
[2023-06-13T16:02:17] Walking source path
[2023-06-13T16:02:17] Walking /hps/scratch/sysinf/power_usage_aelmazaty
[2023-06-13T16:02:17] Walked 1 items in 0.001 secs (872.339 items/sec) ...
[2023-06-13T16:02:17] Walked 1 items in 0.001 seconds (804.228 items/sec)
[2023-06-13T16:02:17] Walking destination path
[2023-06-13T16:02:17] Walking /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:02:17] Walked 1 items in 0.000 secs (2210.726 items/sec) ...
[2023-06-13T16:02:17] Walked 1 items in 0.000 seconds (2045.349 items/sec)
[2023-06-13T16:02:17] Comparing file sizes and modification times of 1 items
[2023-06-13T16:02:17] Started : Jun-13-2023, 16:02:17
[2023-06-13T16:02:17] Completed : Jun-13-2023, 16:02:17
[2023-06-13T16:02:17] Seconds : 0.000
[2023-06-13T16:02:17] Items : 1
[2023-06-13T16:02:17] Item Rate : 1 items in 0.000162 seconds (6177.720668 items/sec)
[2023-06-13T16:02:17] Deleting items from destination
[2023-06-13T16:02:17] Removing 1 items
[2023-06-13T16:02:17] Removed 1 items in 0.003 seconds (327.228 items/sec)
[2023-06-13T16:02:17] Copying items to destination
[2023-06-13T16:02:17] Copying to /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:02:17] Items: 1
[2023-06-13T16:02:17] Directories: 0
[2023-06-13T16:02:17] Files: 1
[2023-06-13T16:02:17] Links: 0
[2023-06-13T16:02:17] Data: 16.636 KiB (16.636 KiB per file)
[2023-06-13T16:02:17] Creating 1 files.
[2023-06-13T16:02:17] Copying data.
[2023-06-13T16:02:17] Copy data: 16.636 KiB (17035 bytes)
[2023-06-13T16:02:17] Copy rate: 1.207 MiB/s (17035 bytes in 0.013 seconds)
[2023-06-13T16:02:17] Syncing data to disk.
[2023-06-13T16:02:17] Sync completed in 0.020 seconds.
[2023-06-13T16:02:17] Setting ownership, permissions, and timestamps.
[2023-06-13T16:02:17] Updated 1 items in 0.003 seconds (298.208 items/sec)
[2023-06-13T16:02:17] Syncing directory updates to disk.
[2023-06-13T16:02:17] Sync completed in 0.001 seconds.
[2023-06-13T16:02:17] Started: Jun-13-2023,16:02:17
[2023-06-13T16:02:17] Completed: Jun-13-2023,16:02:17
[2023-06-13T16:02:17] Seconds: 0.043
[2023-06-13T16:02:17] Items: 1
[2023-06-13T16:02:17] Directories: 0
[2023-06-13T16:02:17] Files: 1
[2023-06-13T16:02:17] Links: 0
[2023-06-13T16:02:17] Data: 16.636 KiB (17035 bytes)
[2023-06-13T16:02:17] Rate: 391.203 KiB/s (17035 bytes in 0.043 seconds)
[2023-06-13T16:02:17] Updating timestamps on newly copied files

It works normally without getting any errors.

I tried different openmpi versions. All installed via spack. The latest is 4.1.5. I get the same error on all of them.

Is that a know issue? How can I avoid these errors?
Best regards,
Ahmed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant