Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Status flapping between UpToDate and Outdated #89

Open
doanac opened this issue May 8, 2019 · 3 comments
Open

Device Status flapping between UpToDate and Outdated #89

doanac opened this issue May 8, 2019 · 3 comments

Comments

@doanac
Copy link

doanac commented May 8, 2019

Found on version: device-registry:0.3.0-12-ga1ea92b

We have at least one device in our OTA Connect deployment who's status seems to be flapping between UpToDate and Outdated.

When looking at the logs I see the following messages repeating over and over. I have a feeling something here is the cause of the device status not looking correct, but I have no idea. It almost seems like there's a message in the queue we can't process and are hitting it over and over and over again.

Any ideas how we can get unstuck?

I|2019-05-08 03:40:12,271|o.a.k.c.consumer.ConsumerConfig|ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [kafka:9092]
	check.crcs = true
	client.id = consumer-device-registry-DeviceUpdateEvent-ce
	connections.max.idle.ms = 540000
	enable.auto.commit = false
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = device-registry-DeviceUpdateEvent-ce
	heartbeat.interval.ms = 3000
	interceptor.classes = null
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 305000
	retry.backoff.ms = 100
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class com.advancedtelematic.libats.messaging.kafka.JsonDeserializer

I|2019-05-08 03:40:12,280|o.a.kafka.common.utils.AppInfoParser|Kafka version : 0.11.0.1
I|2019-05-08 03:40:12,281|o.a.kafka.common.utils.AppInfoParser|Kafka commitId : c2a0d5f9b1f45bf5
I|2019-05-08 03:40:12,283|c.a.l.m.d.MessageBusListenerActor|Subscribed to DeviceUpdateEvent
I|2019-05-08 03:40:12,348|o.a.k.c.c.i.AbstractCoordinator|Discovered coordinator 10.32.13.4:9092 (id: 2147483647 rack: null) for group device-registry-DeviceUpdateEvent-ce.
I|2019-05-08 03:40:12,349|o.a.k.c.c.i.ConsumerCoordinator|Revoking previously assigned partitions [] for group device-registry-DeviceUpdateEvent-ce
I|2019-05-08 03:40:12,355|o.a.k.c.c.i.AbstractCoordinator|(Re-)joining group device-registry-DeviceUpdateEvent-ce
I|2019-05-08 03:40:15,360|o.a.k.c.c.i.AbstractCoordinator|Successfully joined group device-registry-DeviceUpdateEvent-ce with generation 10645
I|2019-05-08 03:40:15,363|o.a.k.c.c.i.ConsumerCoordinator|Setting newly assigned partitions [DeviceUpdateEvent-ce-0] for group device-registry-DeviceUpdateEvent-ce
W|2019-05-08 03:40:15,371|akka.kafka.KafkaConsumerActor|Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
E|2019-05-08 03:40:15,666|akka.stream.Materializer|[DeviceUpdateEvent.listener] Upstream failed.
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
I|2019-05-08 03:40:15,684|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(4499fe53-23b4-4599-b137-576bea3b21a8)
I|2019-05-08 03:40:15,691|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(e0da65df-b959-4f47-8131-104a3f0d0645)
I|2019-05-08 03:40:15,691|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(0aebdbcc-30fa-4ffd-8696-6dd537bfe16d)
I|2019-05-08 03:40:15,691|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(9a9773ed-a001-44d9-85cd-1d9381b826fb)
I|2019-05-08 03:40:15,691|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(aaf4be4c-a6ae-4b4e-a74a-536a8451604e)
I|2019-05-08 03:40:15,691|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(22491ab2-4c35-4d82-9d15-10b829bbaf23)
E|2019-05-08 03:40:15,687|c.a.l.m.d.MessageBusListenerActor|Source/Listener died, subscribing again
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
I|2019-05-08 03:40:15,694|akka.actor.RepointableActorRef|Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] without sender to Actor[akka://device-registry/system/kafka-consumer-11#-761300883] was not delivered. [8] dead letters encountered. If this is not an expected behavior, then [Actor[akka://device-registry/system/kafka-consumer-11#-761300883]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
I|2019-05-08 03:40:20,689|o.a.k.c.consumer.ConsumerConfig|ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [kafka:9092]
	check.crcs = true
	client.id = consumer-device-registry-DeviceUpdateEvent-ce
	connections.max.idle.ms = 540000
	enable.auto.commit = false
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = device-registry-DeviceUpdateEvent-ce
	heartbeat.interval.ms = 3000
	interceptor.classes = null
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 305000
	retry.backoff.ms = 100
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class com.advancedtelematic.libats.messaging.kafka.JsonDeserializer

I|2019-05-08 03:40:20,687|c.a.l.m.d.MessageBusListenerActor|Subscribing to DeviceUpdateEvent
I|2019-05-08 03:40:20,692|c.a.l.m.d.MessageBusListenerActor|Subscribed to DeviceUpdateEvent
I|2019-05-08 03:40:20,694|o.a.kafka.common.utils.AppInfoParser|Kafka version : 0.11.0.1
I|2019-05-08 03:40:20,695|o.a.kafka.common.utils.AppInfoParser|Kafka commitId : c2a0d5f9b1f45bf5
I|2019-05-08 03:40:20,702|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(5dd56d2d-20f5-471a-a03a-93bda553122e)
I|2019-05-08 03:40:20,703|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(4122c08a-00e6-4416-b5ec-800c426f93bb)
I|2019-05-08 03:40:20,703|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(afae4b4a-ae70-4f8c-aab7-eee473894c8c)
I|2019-05-08 03:40:20,703|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(89661359-2a99-4351-9876-31aa18b23b36)
I|2019-05-08 03:40:20,703|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(8bab26ba-d89d-48ce-9160-3aab9900e5a7)
I|2019-05-08 03:40:20,703|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(1ed722c6-7b03-4258-9356-6ccac4c3ed78)
I|2019-05-08 03:40:20,757|o.a.k.c.c.i.AbstractCoordinator|Discovered coordinator 10.32.13.4:9092 (id: 2147483647 rack: null) for group device-registry-DeviceUpdateEvent-ce.
I|2019-05-08 03:40:20,758|o.a.k.c.c.i.ConsumerCoordinator|Revoking previously assigned partitions [] for group device-registry-DeviceUpdateEvent-ce
I|2019-05-08 03:40:20,759|o.a.k.c.c.i.AbstractCoordinator|(Re-)joining group device-registry-DeviceUpdateEvent-ce
I|2019-05-08 03:40:23,771|o.a.k.c.c.i.AbstractCoordinator|Successfully joined group device-registry-DeviceUpdateEvent-ce with generation 10647
I|2019-05-08 03:40:23,772|o.a.k.c.c.i.ConsumerCoordinator|Setting newly assigned partitions [DeviceUpdateEvent-ce-0] for group device-registry-DeviceUpdateEvent-ce
W|2019-05-08 03:40:23,783|akka.kafka.KafkaConsumerActor|Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
E|2019-05-08 03:40:24,048|akka.stream.Materializer|[DeviceUpdateEvent.listener] Upstream failed.
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
E|2019-05-08 03:40:24,061|c.a.l.m.d.MessageBusListenerActor|Source/Listener died, subscribing again
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
I|2019-05-08 03:40:24,092|akka.actor.RepointableActorRef|Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] without sender to Actor[akka://device-registry/system/kafka-consumer-12#-59468305] was not delivered. [9] dead letters encountered. If this is not an expected behavior, then [Actor[akka://device-registry/system/kafka-consumer-12#-59468305]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
I|2019-05-08 03:40:25,679|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(4499fe53-23b4-4599-b137-576bea3b21a8)
I|2019-05-08 03:40:25,680|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(e0da65df-b959-4f47-8131-104a3f0d0645)
I|2019-05-08 03:40:25,680|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(0aebdbcc-30fa-4ffd-8696-6dd537bfe16d)
I|2019-05-08 03:40:25,681|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(9a9773ed-a001-44d9-85cd-1d9381b826fb)
I|2019-05-08 03:40:25,682|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(6a1a5eba-a270-4ef9-b742-b9613ccb2b4d)
I|2019-05-08 03:40:25,683|c.a.l.m.d.MessageBusListenerActor$|Processed DeviceSeen - DeviceId(aaf4be4c-a6ae-4b4e-a74a-536a8451604e)
I|2019-05-08 03:40:29,065|c.a.l.m.d.MessageBusListenerActor|Subscribing to DeviceUpdateEvent
I|2019-05-08 03:40:29,069|o.a.k.c.consumer.ConsumerConfig|ConsumerConfig values: 
	auto.commit.interval.ms = 5000

@doanac
Copy link
Author

doanac commented May 8, 2019

Turning off the gateway, I can get things a little more quiet and the log looks like:

I|2019-05-08 13:03:06,265|o.a.k.c.c.i.ConsumerCoordinator|Setting newly assigned partitions [DeviceUpdateEvent-ce-0] for group device-registry-DeviceUpdateEvent-ce
E|2019-05-08 13:03:06,343|akka.stream.Materializer|[DeviceUpdateEvent.listener] Upstream failed.
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
E|2019-05-08 13:03:06,348|c.a.l.m.d.MessageBusListenerActor|Source/Listener died, subscribing again
com.advancedtelematic.libats.http.Errors$RawError: device doesn't exist
I|2019-05-08 13:03:11,365|c.a.l.m.d.MessageBusListenerActor|Subscribing to DeviceUpdateEvent

@doanac
Copy link
Author

doanac commented May 8, 2019

Turning off all services (director, treehhub, tuf-keyserver, tuf-reposerver and the *-daemon) does not fix things, so it seems like there's something stuck in kafka?

@doanac
Copy link
Author

doanac commented May 8, 2019

out of desperation i did the following:

kafka-topics --delete --topic DeviceUpdateEvent-ce  --zookeeper zookeeper

The device-registry seems to be happy again.

doanac added a commit to foundriesio/ota-api that referenced this issue May 8, 2019
This addresses issue:

 #3

In that it prevents a device to ever get re-created, as this is not
allowed by OTA Connect. This seems like a bug in OTA Connect, but
was easiest to fix here. Additionally, we think this addresses
a bug we had in production:

 advancedtelematic/ota-device-registry#89

Signed-off-by: Andy Doan <andy@foundries.io>
doanac added a commit to foundriesio/ota-api that referenced this issue May 8, 2019
This addresses issue:

 #3

In that it prevents a device to ever get re-created, as this is not
allowed by OTA Connect. This seems like a bug in OTA Connect, but
was easiest to fix here. Additionally, we think this addresses
a bug we had in production:

 advancedtelematic/ota-device-registry#89

Signed-off-by: Andy Doan <andy@foundries.io>
doanac added a commit to foundriesio/ota-api that referenced this issue May 8, 2019
This addresses issue:

 #3

In that it prevents a device to ever get re-created, as this is not
allowed by OTA Connect. This seems like a bug in OTA Connect, but
was easiest to fix here. Additionally, we think this addresses
a bug we had in production:

 advancedtelematic/ota-device-registry#89

Signed-off-by: Andy Doan <andy@foundries.io>
doanac added a commit to foundriesio/ota-api that referenced this issue May 8, 2019
This addresses issue:

 #3

In that it prevents a device to ever get re-created, as this is not
allowed by OTA Connect. This seems like a bug in OTA Connect, but
was easiest to fix here. Additionally, we think this addresses
a bug we had in production:

 advancedtelematic/ota-device-registry#89

Signed-off-by: Andy Doan <andy@foundries.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant