Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Journald] Restart journalctl if it exits unexpectedly #40558

Merged
merged 28 commits into from
Sep 11, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Aug 19, 2024

Proposed commit message

If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

How to test this PR locally

  1. Build Filebeat from this PR and start it with the following configuration file:
filebeat.yml

filebeat.inputs:
  - type: journald
    id: my-unique-input-id
    enabled: true
    syslog_identifiers:
      - "potato"

output:
  console:
    pretty: true

  1. Publish some events with the syslog identfier potato
echo "$(date '+%Y-%m-%d %T') - foo"| systemd-cat -t potato
  1. Wait to see the output in the console
  2. Kill journalctl
killall journalctl
  1. Wait journalctl to restart, look for a log entry like this:
{
  "log.level": "info",
  "@timestamp": "2024-08-20T12:41:29.682-0400",
  "log.logger": "input.journald.reader.journalctl-runner",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.Factory",
    "file.name": "journalctl/journalctl.go",
    "file.line": 120
  },
  "message": "journalctl started with PID 3119456",
  "service.name": "filebeat",
  "id": "filestream-input-id",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}
  1. Publish a few more messages and ensure they appear in the console
echo "$(date '+%Y-%m-%d %T') - foo"| systemd-cat -t potato

Related issues

Tests

I run a test overnight where I used a mock to simulate constant failures and restarts of journalclt and monitored the host and Filebeat process to ensure there weren't any goroutine/resouce leaks. The screenshot below shows the CPU and memory usage from Filebeat as well as counters for the log entries stating journalctl crashed and was restarted.

2024-09-10_09-14

## Use cases
## Screenshots
## Logs

@belimawr belimawr added backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Aug 19, 2024
@belimawr belimawr self-assigned this Aug 19, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 19, 2024
@belimawr belimawr changed the title Restart journalctl [Journald] Restart journalctl if it exits unexpectedly Aug 20, 2024
Copy link
Contributor

mergify bot commented Aug 20, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

@belimawr belimawr added skip-ci Skip the build in the CI but linting and removed skip-ci Skip the build in the CI but linting labels Aug 20, 2024
Copy link
Contributor

mergify bot commented Aug 26, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

@belimawr belimawr marked this pull request as ready for review August 26, 2024 20:54
@belimawr belimawr requested a review from a team as a code owner August 26, 2024 20:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

filebeat/input/journald/pkg/journalctl/journalctl.go Outdated Show resolved Hide resolved
logger.Infof("journalctl started with PID %d", cmd.Process.Pid)

go func() {
if err := cmd.Wait(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should also watch for the dataChan and errChan to close. For example, if we failed to read stdout for some reason, that go routine loop would exit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goroutine is here only to log any error returned by cmd.Wait, if the process exits unexpectedly its stderr and stdout will be closed and the reader gorourines will get an EOF or error. At the moment I'm doing the best to read and ship/log all data without over complicating the code and risking getting into a dead lock.

filebeat/input/journald/pkg/journalctl/journalctl.go Outdated Show resolved Hide resolved
filebeat/input/journald/pkg/journalctl/journalctl.go Outdated Show resolved Hide resolved

func (j *journalctl) Next(cancel input.Canceler) ([]byte, error) {
select {
case <-cancel.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cancel.Done(), do we have to kill the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This input.Canceler is in the context of the Next call, so I don't think we should kill the journalctl process. If the journald input is stopped, it will call Close on the Reader:

https://github.com/belimawr/beats/blob/e7d226030d8a90477949a4cd3a013e09c28db519/filebeat/input/journald/input.go#L236-L238

and the reader will kill the journalctl process

https://github.com/belimawr/beats/blob/e7d226030d8a90477949a4cd3a013e09c28db519/filebeat/input/journald/pkg/journalctl/reader.go#L192-L200.

I added some comments explaining kill needs to be called.

Copy link
Contributor

mergify bot commented Sep 10, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

Fully isolates the journalctl process into a new type called
journalctl, so the reader does not know about it and can simply call a
Next method on the new type.
If journalctl exits for any reason, the reader now restarts it.
This commit adds a test to the journald reader that ensures is can
restart the journalctl process in case it exits. It is tested by
mocking the type abstracting the calls to journalctl, if there is an
error reading the next message, we ensure a new journalclt is created.
This commit refactors TestEventWithNonStringData to use the new types
and re-enables the test.
This commit adds an integration test for the journald input. The test
ensures that if the journalctl process exits unexpectedly it is
restarted.
Fix publishing an empty message when journalctl crashes and needs to
be restarted. Now when the reader restarts journalctl it returns
ErrRestarting and the input can correctly handle this situation.

The tests are updated and the code formatted (mage fmt).
Test the bytes written to systemd-cat on every call, properly handle
errors when closing its stdin and better error reporting.
This commit removes two wait groups that were not being used correctly
and improves error handling.

Now if journalctl writes to stderr, the lines are read and directly
logged.

Lint warnings are also fixed.
Make sure everywhere in the reader and its subcomponets the logger
name is consistent.
When restarting journalclt, a exponential backoff will be used if the
last restart was less than 5s ago. In the unlikely case jouranlctl
will crash right after being installed, the exponential backoff will
make Filebeat restart journalctl at most once every two seconds.
@belimawr belimawr requested a review from a team as a code owner September 10, 2024 18:49
@v1v v1v added backport-v8.x and removed backport-skip Skip notification from the automated backport with mergify labels Sep 10, 2024
Copy link
Contributor

mergify bot commented Sep 11, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 11, 2024
@v1v v1v removed the backport-v8.x label Sep 11, 2024
@belimawr belimawr merged commit a9fb9fa into elastic:main Sep 11, 2024
123 of 125 checks passed
mergify bot pushed a commit that referenced this pull request Sep 11, 2024
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s.

(cherry picked from commit a9fb9fa)
belimawr added a commit that referenced this pull request Sep 13, 2024
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s.

(cherry picked from commit a9fb9fa)

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
djaglowski pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Oct 28, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
jpbarto pushed a commit to jpbarto/opentelemetry-collector-contrib that referenced this pull request Oct 29, 2024
…-telemetry#35635)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
5 participants