[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

belimawr · 2024-05-01T20:45:41Z

Filebat: 8.13.2
Host OS: Amazon Linux 2
Systemd/Journald version: systemd 252 (252.16-1.amzn2023.0.2)

journalctl --version
systemd 252 (252.16-1.amzn2023.0.2)
+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 -BZIP2 -LZ4 +XZ -ZLIB -ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

##How to reproduce

Flood jounrald with logs so it it rotates logs every minute or so. Mostly follow [Filebeat] Journald causes Filebeat to crash #34077 (comment)
Start Filebeat with the config from the above link
Wait until Journald reaches its maximum number of files and starts deleting old entries
Filebeat might crash due to [Filebeat] Journald causes Filebeat to crash #34077, it's ok. Ignore it
Let the logs flowing for a while (I waited for hours)
Start Filebeat again

Journald input will fail with:

{"log.level":"error","@timestamp":"2024-05-01T19:29:01.010Z","log.logger":"input.journald","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/v2/compat.(*runner).Start.func1","file.name":"compat/compat.go","file.line":132},"message":"Input 'journald' failed with: input.go:130: input journald-input failed (id=journald-input)\n\tfailed to read message field: cannot allocate memory","service.name":"filebeat","id":"journald-input","ecs.version":"1.6.0"}

Sometimes Filebeat might just crash again. I also saw it failing once or twice with the same message as in #32782.

Both seem to be relates with Filebeat being too far behind reading the journal, probably further behind than what journald has got stored in disk.

On both cases the error is coming from the Journald library we use, github.com/coreos/go-systemd/v22

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-01T20:48:45Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2024-08-09T21:13:11Z

Even after merging #40061, I can still reproduce this "cannot allocate memory" error.

Here is how the new log error look:

{
  "log.level": "error",
  "@timestamp": "2024-08-08T21:02:16.242Z",
  "log.logger": "input.journald",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.(*Reader).Close",
    "file.name": "journalctl/reader.go",
    "file.line": 256
  },
  "message": "Journalctl wrote to stderr: Failed to get journal fields: Cannot allocate memory\n",
  "service.name": "filebeat",
  "id": "PR-testing",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

This seems to be happening in journalctl itself, it's reproducible in the same way, but it takes longer to happen. I noticed it happening more often when the system is under pressure (VM with all CPUs at 100% and 2Gb of ram).

Interestingly enough this situation is also reproducible when running the Otel journald receiver that also calls journalctl and reads it's JSON output.

I believe the easiest solution for this on Filebeat is to be more resilient to journalctl crashes, restarting it instead of stopping the input.

belimawr · 2024-08-09T21:25:57Z

@pierrehilbert I changed the status to 'need technical definition' because we need to decide how to handle this error, at the moment the best option seems to make the journald input resilient to journalctl crashes and then validate whether this is still an issue that needs to be addressed directly.

cmacknz · 2024-08-12T14:08:12Z

Same as #32782 essentially.

We need to automatically recover from this if we aren't already.

cmacknz · 2024-08-12T14:09:06Z

This one seems more like system memory exhaustion though, how much memory is available on the host when it happens? Can other programs allocate memory?

belimawr · 2024-08-13T15:52:54Z

This one seems more like system memory exhaustion though, how much memory is available on the host when it happens? Can other programs allocate memory?

IIRC increasing the VM's memory helped, but I didn't see the memory reach 100% in any of my tests. The CPUs were at 100%. I didn't see the whole system crashing, nor the system became unresponsive.

Once I start working on those issues I'll properly collect system metrics, probably with Metricbeat.

belimawr · 2024-08-13T21:06:48Z

I'll start working on recovering if journalctl crashes, for this case simply restarting should suffice and should not create any problems with offset tracking.

belimawr · 2024-09-13T20:57:38Z

This is not fixed by #40558

belimawr · 2024-09-13T21:21:13Z

I managed to reproduce this issue by just calling journalctl and the machine seems to have plenty of memory available:

[root@aws-test-journald ~]# journalctl --utc --output=json --follow --after-cursor "s=922eeded44734fd9b2fe892ceb4ec2df;i=2e21e0;b=5d68f7c3ebf040879a4ab3ccbb2d965b;m=2e2762073e;t=62206b43c435e;x=46e90e97a06137ad">journal4.ndjson
Failed to read journal: Cannot allocate memory
[root@aws-test-journald ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       1.0Gi       7.9Gi       120Mi       6.7Gi        14Gi
Swap:             0B          0B          0B
[root@aws-test-journald ~]#

CPU usage is not low with some cores hitting 100%, but not all at the same time.

I believe the best we can do here is to make sure the input does not get stuck, which is done by #40558.

I'll close this issue as solved.

#### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>

…-telemetry#35635)  #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 1, 2024

belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 1, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 1, 2024

This was referenced May 1, 2024

GA support for reading from journald #37086

Open

[Filebeat] Journald causes Filebeat to crash #34077

Closed

belimawr mentioned this issue Jun 6, 2024

Use journalctl to read Journald logs #39820

Closed

belimawr mentioned this issue Jul 1, 2024

Replace github.com/coreos/go-systemd/v22/sdjournal by journalctl #40061

Merged

10 tasks

pierrehilbert assigned belimawr Jul 2, 2024

belimawr mentioned this issue Aug 9, 2024

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Closed

belimawr mentioned this issue Aug 20, 2024

[Journald] Restart journalctl if it exits unexpectedly #40558

Merged

6 tasks

belimawr closed this as completed in #40558 Sep 11, 2024

mergify bot mentioned this issue Sep 11, 2024

[8.x](backport #40558) [Journald] Restart journalctl if it exits unexpectedly #40772

Merged

6 tasks

belimawr reopened this Sep 13, 2024

belimawr closed this as completed Sep 13, 2024

namco1992 mentioned this issue Oct 6, 2024

[receiver/journald] Restart journalctl if it exits unexpectedly open-telemetry/opentelemetry-collector-contrib#35635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

belimawr commented May 1, 2024

elasticmachine commented May 1, 2024

belimawr commented Aug 9, 2024

belimawr commented Aug 9, 2024

cmacknz commented Aug 12, 2024

cmacknz commented Aug 12, 2024

belimawr commented Aug 13, 2024

belimawr commented Aug 13, 2024

belimawr commented Sep 13, 2024

belimawr commented Sep 13, 2024

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

Comments

belimawr commented May 1, 2024

elasticmachine commented May 1, 2024

belimawr commented Aug 9, 2024

belimawr commented Aug 9, 2024

cmacknz commented Aug 12, 2024

cmacknz commented Aug 12, 2024

belimawr commented Aug 13, 2024

belimawr commented Aug 13, 2024

belimawr commented Sep 13, 2024

belimawr commented Sep 13, 2024