-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Journald] Restart journalctl if it exits unexpectedly #40558
Conversation
This pull request is now in conflicts. Could you fix it? 🙏
|
93230ef
to
6507efe
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
6507efe
to
7b93e74
Compare
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
logger.Infof("journalctl started with PID %d", cmd.Process.Pid) | ||
|
||
go func() { | ||
if err := cmd.Wait(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should also watch for the dataChan
and errChan
to close. For example, if we failed to read stdout for some reason, that go routine loop would exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goroutine is here only to log any error returned by cmd.Wait
, if the process exits unexpectedly its stderr and stdout will be closed and the reader gorourines will get an EOF or error. At the moment I'm doing the best to read and ship/log all data without over complicating the code and risking getting into a dead lock.
cc66c5d
to
7037dc2
Compare
|
||
func (j *journalctl) Next(cancel input.Canceler) ([]byte, error) { | ||
select { | ||
case <-cancel.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cancel.Done(), do we have to kill the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This input.Canceler
is in the context of the Next
call, so I don't think we should kill the journalctl process. If the journald input is stopped, it will call Close on the Reader:
and the reader will kill the journalctl process
I added some comments explaining kill needs to be called.
This pull request is now in conflicts. Could you fix it? 🙏
|
Fully isolates the journalctl process into a new type called journalctl, so the reader does not know about it and can simply call a Next method on the new type.
If journalctl exits for any reason, the reader now restarts it.
This commit adds a test to the journald reader that ensures is can restart the journalctl process in case it exits. It is tested by mocking the type abstracting the calls to journalctl, if there is an error reading the next message, we ensure a new journalclt is created.
This commit refactors TestEventWithNonStringData to use the new types and re-enables the test.
This commit adds an integration test for the journald input. The test ensures that if the journalctl process exits unexpectedly it is restarted.
Fix publishing an empty message when journalctl crashes and needs to be restarted. Now when the reader restarts journalctl it returns ErrRestarting and the input can correctly handle this situation. The tests are updated and the code formatted (mage fmt).
Test the bytes written to systemd-cat on every call, properly handle errors when closing its stdin and better error reporting.
This commit removes two wait groups that were not being used correctly and improves error handling. Now if journalctl writes to stderr, the lines are read and directly logged. Lint warnings are also fixed.
Make sure everywhere in the reader and its subcomponets the logger name is consistent.
When restarting journalclt, a exponential backoff will be used if the last restart was less than 5s ago. In the unlikely case jouranlctl will crash right after being installed, the exponential backoff will make Filebeat restart journalctl at most once every two seconds.
3a5aaec
to
e7d2260
Compare
|
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa)
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa) Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>
…-telemetry#35635) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>
Proposed commit message
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
Author's Checklist
How to test this PR locally
filebeat.yml
potato
Related issues
Tests
I run a test overnight where I used a mock to simulate constant failures and restarts of journalclt and monitored the host and Filebeat process to ensure there weren't any goroutine/resouce leaks. The screenshot below shows the CPU and memory usage from Filebeat as well as counters for the log entries stating journalctl crashed and was restarted.
## Use cases## Screenshots## Logs