Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

snjoetw · 2022-08-23T16:09:32Z

Filebeat seems to stop sending journald logs when encountered "failed to read message field: bad message" error

journald input:

- type: journald
  id: appdaemon_logs
  paths:
    - /var/log/journal
  include_matches.match:
    - syslog.priority=6

Logs:

{
	"log.level": "error",
	"@timestamp": "2022-08-23T00:31:06.768-0700",
	"log.logger": "input.journald",
	"log.origin": {
		"file.name": "compat/compat.go",
		"file.line": 124
	},
	"message": "Input 'journald' failed with: input.go:130: input appdaemon_logs failed (id=appdaemon_logs)\n\tfailed to read message field: bad message",
	"service.name": "filebeat",
	"id": "appdaemon_logs",
	"ecs.version": "1.6.0"
}

Filebeat 8.3.3

Similar bug is also reported in Loki, not sure if the fix is similar or not:
Bug Report: grafana/loki#2812
PR: grafana/loki#2928

The text was updated successfully, but these errors were encountered:

jsoriano · 2022-08-25T09:59:22Z

Could be related to the issues discussed in #23627

elasticmachine · 2022-08-25T09:59:34Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

trauta · 2023-01-18T11:49:57Z

I'm having the same problem with filebeat filebeat version 7.17.8 (amd64) on Ubuntu 20.04.5 LTS.

Here is the journald input config:

- type: journald
  id: syslog
  seek: cursor
  fields_under_root: false
  tags:
    - journald
    - syslog

From the filebeat log:

2023-01-11T12:19:13.844+0100	ERROR	[input.journald]	compat/compat.go:124	Input 'journald' failed with: input.go:130: input syslog failed (id=syslog)
	failed to iterate journal: bad message	{"id": "syslog"}

After this error message the journald input completey stops, there are no new journald events transmitted.
When restarting filebeat via systemd the missing journal events get transmitted without any problems.

This is a very irritating behavior.

Is it possible to implement a fix so that in case of such an error the 'bad mesage' is skipped and the input continues to parse the other messages?

trauta · 2023-06-30T09:25:37Z

Hi, any updates on this?

cmacknz · 2023-07-04T14:55:50Z

Sorry, we haven't been able to prioritize this issue yet.

nicklausbrown · 2023-07-25T05:02:35Z

Any chance this will be prioritized soon, or a timeline on promoting journald out of tech preview? Thanks!

georgivalentinov · 2023-08-09T08:12:22Z

Seems like the underlying issue is with systemd - they seem to have introduced a regression and have trouble finding and fixing it.
That being said, this is a major issue as people are moving away from log files and the use of journald is prevalent these days.

Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest? As is today it blocks and silently stops processing any log entries upon stumbling into a problematic spot.

groegeorg · 2024-04-18T11:31:04Z

I'm also desperately hoping for a fix on this issue. I second georgivalentinov's statement: "Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest?"

cmacknz · 2024-04-18T17:30:12Z

We have taking journald out of tech preview as a priority (see #37086) and this needs to be fixed as part of that work.

belimawr · 2024-05-01T22:14:30Z

I've been investigating this crash, it is reproducible like #34077, however it also happens with the following version of journald from Ubuntu 24.04 LTS

systemd 255 (255.4-1ubuntu8)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

It is coming from

beats/filebeat/input/journald/pkg/journalread/reader.go

Line 185 in ffcd181

entry, err := r.journal.GetEntry()

belimawr · 2024-08-09T21:20:39Z

Even after merging #40061, I can still reproduce this "cannot allocate memory" error.

Here is how the new log error look:

{
  "log.level": "error",
  "@timestamp": "2024-08-08T20:33:26.947Z",
  "log.logger": "input.journald",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.(*Reader).Close",
    "file.name": "journalctl/reader.go",
    "file.line": 256
  },
  "message": "Journalctl wrote to stderr: Failed to iterate through journal: Bad message\n",
  "service.name": "filebeat",
  "id": "PR-testig",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

It seems that even journalctl is struggling to read some messages, what makes it hard to debug is that so far I've only managed to reproduce it when the system and the journal are under high load and sometimes I get a memory error first (see #39352).

The solution here seems to be the same, make the journald input restart journalctl, however we need to understand what happens the message/cursor. Is the message lost? Will we get stuck on this message? Can we skip it in code and log an error/warning for the user?

belimawr · 2024-08-09T21:25:39Z

@pierrehilbert I changed the status to 'need technical definition' because we need to decide how to handle this error, at the moment the best option seems to make the journald input resilient to journalctl crashes and then validate whether this is still an issue that needs to be addressed directly.

cmacknz · 2024-08-12T14:07:38Z

the moment the best option seems to make the journald input resilient to journalctl crashes

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

belimawr · 2024-08-12T16:59:46Z

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

I agree, but we need to be careful on how we implement this to avoid getting stuck on a "bad message", so far I only managed to reproduce this when putting the system under stress to test/investigate the journal rotation related crashes.

We already have an issue to track this improvement: #39355.

belimawr · 2024-09-13T20:57:26Z

This is not fixed by #40558

belimawr · 2024-09-13T21:24:02Z

I've just managed to reproduce this issue by calling journalctl directly:

[root@aws-test-journald ~]# journalctl --utc --output=json --follow --after-cursor 's=922eeded44734fd9b2fe892ceb4ec2df;i=25a960;b=5d68f7c3ebf040879a4ab3ccbb2d965b;m=1a474efb16;t=621f2d4293733;x=c148d181bdad7f5b' > ./journal2.ndjson                                      
Failed to iterate through journal: Bad message

Both, Filebeat and journalctl were able to continue reading the journal using the last know cursor, and I did not see any indication that we'd get stuck in a "Bad message" crash loop, so I'm closing this issue as solved by #40558

#### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>

…-telemetry#35635)  #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <namco1992@gmail.com>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2022

jsoriano added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 25, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 25, 2022

nroach44 mentioned this issue May 19, 2023

Sidecar does not correctly detect a stalled filebeat journald input Graylog2/collector-sidecar#471

Open

cmacknz mentioned this issue Nov 10, 2023

GA support for reading from journald #37086

Open

3 tasks

pierrehilbert added the bug label Feb 6, 2024

This was referenced May 1, 2024

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

Closed

[Filebeat] Journald causes Filebeat to crash #34077

Closed

belimawr mentioned this issue Jun 6, 2024

Use journalctl to read Journald logs #39820

Closed

belimawr mentioned this issue Jul 1, 2024

Replace github.com/coreos/go-systemd/v22/sdjournal by journalctl #40061

Merged

10 tasks

pierrehilbert assigned belimawr Jul 2, 2024

belimawr mentioned this issue Aug 20, 2024

[Journald] Restart journalctl if it exits unexpectedly #40558

Merged

6 tasks

belimawr closed this as completed in #40558 Sep 11, 2024

mergify bot mentioned this issue Sep 11, 2024

[8.x](backport #40558) [Journald] Restart journalctl if it exits unexpectedly #40772

Merged

6 tasks

belimawr reopened this Sep 13, 2024

belimawr closed this as completed Sep 13, 2024

namco1992 mentioned this issue Oct 6, 2024

[receiver/journald] Restart journalctl if it exits unexpectedly open-telemetry/opentelemetry-collector-contrib#35635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

snjoetw commented Aug 23, 2022 •

edited

Loading

jsoriano commented Aug 25, 2022

elasticmachine commented Aug 25, 2022

trauta commented Jan 18, 2023 •

edited

Loading

trauta commented Jun 30, 2023

cmacknz commented Jul 4, 2023

nicklausbrown commented Jul 25, 2023

georgivalentinov commented Aug 9, 2023 •

edited

Loading

groegeorg commented Apr 18, 2024 •

edited

Loading

cmacknz commented Apr 18, 2024

belimawr commented May 1, 2024 •

edited

Loading

belimawr commented Aug 9, 2024

belimawr commented Aug 9, 2024

cmacknz commented Aug 12, 2024

belimawr commented Aug 12, 2024

belimawr commented Sep 13, 2024

belimawr commented Sep 13, 2024

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Comments

snjoetw commented Aug 23, 2022 • edited Loading

jsoriano commented Aug 25, 2022

elasticmachine commented Aug 25, 2022

trauta commented Jan 18, 2023 • edited Loading

trauta commented Jun 30, 2023

cmacknz commented Jul 4, 2023

nicklausbrown commented Jul 25, 2023

georgivalentinov commented Aug 9, 2023 • edited Loading

groegeorg commented Apr 18, 2024 • edited Loading

cmacknz commented Apr 18, 2024

belimawr commented May 1, 2024 • edited Loading

belimawr commented Aug 9, 2024

belimawr commented Aug 9, 2024

cmacknz commented Aug 12, 2024

belimawr commented Aug 12, 2024

belimawr commented Sep 13, 2024

belimawr commented Sep 13, 2024

snjoetw commented Aug 23, 2022 •

edited

Loading

trauta commented Jan 18, 2023 •

edited

Loading

georgivalentinov commented Aug 9, 2023 •

edited

Loading

groegeorg commented Apr 18, 2024 •

edited

Loading

belimawr commented May 1, 2024 •

edited

Loading