-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][observability] Logs are duplicated if multiple nodes are running on same machine #48676
base: master
Are you sure you want to change the base?
[core][observability] Logs are duplicated if multiple nodes are running on same machine #48676
Conversation
python/ray/_private/node.py
Outdated
filename = ray_constants.LOG_MONITOR_LOG_FILE_NAME | ||
file_path = os.path.join(self._logs_dir, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file won't exist if redict_logging
is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case will this happen?
Case 1: Do you mean setting RAY_LOG_TO_STDERR=1
for the head node and RAY_LOG_TO_STDERR=0
for the worker nodes? I am not sure what the expected behavior is for this. Should we redirect the logs to the driver process or not?
RAY_LOG_TO_STDERR=1 ray start --head --include-dashboard=True
ray start --address=127.0.0.1:6379
# Try to schedule the task to the worker node
python3 example.py
Case 2:
ray start --head --include-dashboard=True
ray start --address=127.0.0.1:6379
RAY_LOG_TO_STDERR=1 python3 example.py
log_monitor.log
will be created by ray start --head --include-dashboard=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we plan to handle this case, how about creating an empty file named is_head
in the log directory? The worker node can check whether the file exists to decide whether to launch the log monitor process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Why are these changes needed?
Each Ray node has a LogMonitor, which watches logs under
/tmp/ray/$session_name/logs
and publishes them to GCS so that the driver process can access the logs.However, when we launch two Ray nodes on a single host, both LogMonitor watch the same directory
/tmp/ray/$session_name/logs
.Related issue number
Closes #48642
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.Without this PR
With this PR