-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the Redis queue password deletes all queued events... #167
Comments
Hi @mmangione sorry about this, I understand your frustration. I have just checked and I'm not seeing any flush that we are doing in the production code. Are you referring to a specific piece in our code? |
@tsteur Hello. I can’t say for sure. I was watching the monitor when I changed the password. It went from 230,000 queued events to 0. I didn’t change the database number (or anything else), so it has to be that those events were flushed from Redis. Usually, you can just remove the keys and it will delete all data. I assumed it was an explicit command, but it might be some other mechanism that accomplishes the same thing. |
Did this happen the moment you changed the password in Redis, or when you changed it in Matomo? Was the password maybe also changed somewhere else in case the Redis is used somewhere else too? Just double checking to make sure it's caused by Matomo. Did you also change any other setting in Matomo Queued Tracking? When you save the settings, I can't see anything that would cause this. It only tries to connect to Redis and select the configured DB to test the entered credentials and that's all. |
@tsteur We changed the password in Redis, then, the workers stopped running. We went from 16 workers to 0. It was like this for a couple of minutes while we logged into the system. It simply repeated the same event count over and over. When we changed the password in Matomo, monitor paused, flashed, and event count fell from 230,000 to 0 - and all of the worker counts dropped to 0. The queues functioned normally again after the reset. |
@mmangione could you also answer the other questions like did you change any other setting in Matomo Queued tracking when you changed the password? Was redis maybe ever restarted and then the data got lost? I haven't tried to reproduce it yet but so far I can't really imagine how Matomo would drop the data there. |
Hello @tsteur! It happened again and we just lost about 1.5M events over about 6 hours. Here's what happened:
As I write this, our systems are back to processing events and we are maintaining about 200 events in the queue at any given time. But, we found this out accidentally - without any logs or other guidance. At this point, this is the second time we've lost a very significant amount of events because this tracking plugin decided to stop working after a DB change. The last time, we changed the DB user. This time, we changed the IP address - and somehow that caused us to slow processing events to a crawl, develop a memory leak, then start purging stored events. We looked everywhere for logs, but we could not find any that are relevant. Whatever was happening was occurring silently and in a way that we couldn't diagnose. We are running in GCP in their MemoryStore service. I can confirm our Redis server was never restarted nor has it been touched since the last time I had to remove AUTH to get this working again. Nor was our main DB cluster restarted or touched at all. |
@tsteur Any thoughts? |
@tsteur Pinging again. |
Maybe there is a memory limit in the config for the command line? PHP definitely says there is a memory limit. You could try and run for example
@mmangione are you also terminating and re-launching all the queued tracking workers after this change? I would recommend to also clear Matomo caches after such a change I'm personally not working on this feature anymore so I can't really reproduce this issue easily unfortunately. |
@tsteur We diagnosed and remediated the issue. This issue was part of a larger issue. We developed a really bad instability where the queues started randomly resetting after upgrade to 4.13.0. Our queues looked absolutely atrocious - and we were losing about 30-40% of events to these resets. Here is a graph of the queues resetting during normal activity. This is under light load. We enabled the Matomo logging and found this error:
The main issue turned out to be that we were using the same Redis server for both caching and for the queues. We, of course, used different databases for each service. We do this with many of our apps, and we have many apps on a single Redis server without issue. In this case, there must be an overly broad flush command either in RedisCache or this QueuedTracking plugin. Once we separated our servers and spun up 2 separate servers, all of our issues with configuration, resetting queues, purging queues, and all other forms of instability stopped. I hope this helps someone in the future because we spent months on this, then even purchased Premium Support... all to no avail. |
Then the only thing left to do with this issue would maybe to add some info about this in the documentation? |
Thanks @mmangione That must have been hard to find. So to just to confirm the issue got resolved by splitting the cache and the queued tracking into two different redis servers and not just two different DBs? Indeed while the cache is doing a flushdb, this plugin does a flushall which would explain the problem. |
We basically need to change https://github.com/matomo-org/plugin-QueuedTracking/blob/4.x-dev/Queue/Backend/Redis.php#L250 to |
Actually that method is only used in tests, and not in production code so this won't fix. |
@tsteur We use a single Redis server and separate our applications by using different database numbers. The credentials are the same and the auth is the same. Only the fact that reference of the DBs by their number changes. |
@tsteur We suspect the issue has been going on since we first setup our instance in 2020. Our visits, visitors, and PPC campaign results were off by (+/-)20% some days (against GA). We could never quite figure out why. We did notice that our queues would hover around 200 - 300 actions to process, and would occasionally drop to 0 - 10, then climb back up immediately. With 4.13.0, we saw a 10,000% increase in the network traffic of our Redis server. That sudden load stressed every piece of our infrastructure and showed us how many areas needed to be tuned properly. We saw the queues behaving strangely, but thought it might be normal - it as an issue until 4.13.0 caused huge buildups and dumps of the queues. Without that sudden significant increase in load, I think this issue wasn't have been noticeable. It was only after we systematically addressed all of the other possible causes and the problem persisted that we thought to reach out to Matomo Support. It was actually a fortuitous off-handed remark by Richard about being surprised that we were "using the same Redis server for both caching and queued tracking" that we even thought to question our architecture. We got very lucky here. It wasn't even on our radar otherwise. I guess a good lesson here is to stress test all production systems periodically. We have definitely learned that lesson. Hopefully, this explanation of the process of identifying this issue will be helpful to future devs. If you're reading this in the future, go stress test your setup. Send more and more traffic at it until it fails, then identify what failed and mitigate it. Do it now. If you are having any intermittent issue, this should be your strategy to find the root cause. |
@mmangione are you maybe also using Redis for sessions? I couldn't locate anything yet that would do any flush or explain this. There was one flush, but this is actually only running in tests. @heurteph-ei I'll update an FAQ for now |
@tsteur Sorry for the delay! I didn't see this until I received the notification that you had reopened the issue. We were using Redis for Matomo's caching and for queued tracking. We were using the same Redis server with different databases, for each system. The problem went completely away after we stood up a new Redis server and edited our configurations. |
We just lost 230,000 events of customer data because we changed our Redis queue password from the settings. I can't imagine why one would ever intentionally issue a FLUSH command after a setting is updated. Or, if you knew about this why is not in all caps in that setting card.
We have to periodically cycle passwords for security. We need to be able to do that from our Matomo instance WITHOUT having our system purge our queued events.
That has to be a bug. That behavior is extremely harmful. We are trying to use this system at the scale of 10M hits/day. It's going to take days for us to try to reconstruct the lost data that the queueing system just purged unexpectedly.
The text was updated successfully, but these errors were encountered: