Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature COMMANDLOG to record slow execution and large request/reply #1294

Open
wants to merge 13 commits into
base: unstable
Choose a base branch
from

Conversation

soloestoy
Copy link
Member

@soloestoy soloestoy commented Nov 12, 2024

As discussed in PR #336.

We have different types of resources like CPU, memory, network, etc. The slowlog can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example:

  1. run "SET key value(10 megabytes)" command would not be recored in slowlog, since when processing it the SET command only insert the value's pointer into db dict. But that command eats huge memory in query buffer and bandwidth from network. In this case, just 1000 tps can cause 10GB/s network flow.
  2. run "GET key" command and the key's value length is 10 megabytes. The get command can eat huge memory in output buffer and bandwidth to network.

This PR introduces a new command COMMANDLOG, to log commands that consume significant network bandwidth, including both input and output. Users can retrieve the results using COMMANDLOG get <count> large-request and COMMANDLOG get <count> large-reply, all subcommands for COMMANDLOG are:

  • COMMANDLOG HELP
  • COMMANDLOG GET <count> <slow|large-request|large-reply>
  • COMMANDLOG LEN <slow|large-request|large-reply>
  • COMMANDLOG RESET <slow|large-request|large-reply>

And the slowlog is also incorporated into the commandlog.

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

Attention: Patch coverage is 92.12598% with 10 lines in your changes missing coverage. Please review.

Project coverage is 70.77%. Comparing base (a939cb8) to head (2fe4507).

Files with missing lines Patch % Lines
src/commandlog.c 92.79% 8 Missing ⚠️
src/module.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1294      +/-   ##
============================================
+ Coverage     70.59%   70.77%   +0.17%     
============================================
  Files           117      117              
  Lines         63324    63356      +32     
============================================
+ Hits          44704    44839     +135     
+ Misses        18620    18517     -103     
Files with missing lines Coverage Δ
src/blocked.c 91.90% <100.00%> (ø)
src/commands.def 100.00% <ø> (ø)
src/config.c 78.83% <ø> (ø)
src/latency.c 80.87% <100.00%> (ø)
src/server.c 87.66% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/module.c 9.64% <0.00%> (-0.01%) ⬇️
src/commandlog.c 92.79% <92.79%> (ø)

... and 15 files with indirect coverage changes

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
@hwware
Copy link
Member

hwware commented Nov 15, 2024

Just for Note: Let's first refresh the memory in pr #336, the last comment conclusion is:

After a core team meeting, we decided adding a new command COMMANDLOG with subcommands HEAVYTRAFFIC and SLOW, and then slowlog.c can be renamed to a common commandlog.c. #336 (comment)

@hwware
Copy link
Member

hwware commented Nov 15, 2024

From my understanding, for the command: commandlog get should be:
COMMANDLOG get count slow | heavytraffic-input | heavytraffic-output.

commandlog len should be:
COMMANDLOG len slow | heavytraffic-input | heavytraffic-output.

commandlog reset should be:
COMMANDLOG reset slow | heavytraffic-input | heavytraffic-output.

Can you describe the terms heavytraffic-input and heavytraffic-output in the json file (argument part) because I can only know
them from the source codes and valkey.conf so far?

And I think the existing slowlog commands should be deprecated? If yes, I think you should update the related json files as well.

valkey.conf Outdated Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
src/commandlog.c Outdated Show resolved Hide resolved
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
@soloestoy soloestoy changed the title Feature COMMANDLOG to record slow and heavy traffic Feature COMMANDLOG to record slow execution and large request/reply Nov 20, 2024
Copy link
Member

@hwware hwware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approved

src/config.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
valkey.conf Outdated
# unit being bytes.
large-request-larger-than 1024
# Record the number of commands.
large-request-max-len 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about the default value? Isn't it quite low?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean large-request-larger-than?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1KB a good default for both input/output? Lot's of command response can easily exceed 1KB response I believe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can discuss it further, my thought is to work backward from the results. In my view, 100MB/s (nearly 1Gbps) is a large amount of traffic. With this level of traffic, I believe the ops that valkey can provide are on the order of 100,000. Therefore, calculating it backward, a default value of 10KB seems more reasonable.

Comment on lines +3406 to +3408
commandlogPushEntryIfNeeded(c, argv, argc, c->duration, COMMANDLOG_TYPE_SLOW);
commandlogPushEntryIfNeeded(c, argv, argc, c->net_input_bytes_curr_cmd, COMMANDLOG_TYPE_LARGE_REQUEST);
commandlogPushEntryIfNeeded(c, argv, argc, c->net_output_bytes_curr_cmd, COMMANDLOG_TYPE_LARGE_REPLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be maintaining three different list with unique log entries? Or should we have a single list/entry with multiple type associated with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying to put all three types of records into one list? I think they should be independent, otherwise, you would need to traverse the list to get the count of each specific type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3 different lists are better because client maybe not care about all 3 entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soloestoy If it's already implemented in Alibaba, do you have any anecdote from production where the same command showing up in slowlog/large-request/large-reply? My intuition is there will be quite a bit of overlap. Hence, I was thinking if we should avoid creating unique log entries for each commandlog type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, with the current implementation of valkey, there is no direct relationship between large requests/responses and slow execution, because network transmission is not included in the execution time. They need to be measured and viewed independently.

For example, the execution time of command set key value(1MB) might be only 1 microsecond, because it only needs to insert the pointer of this 1MB value into the database. The network read, memory allocation, and data copying for this 1MB are not counted in the execution time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command get key is different, since the execution time includes copying value to output buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I was thinking about the scan type operation. They would ideally take longer cpu cycles as well as the reply size would be large as well.

@@ -4,6 +4,11 @@
"complexity": "Depends on subcommand.",
"group": "server",
"since": "2.2.12",
"arity": -2
"arity": -2,
"deprecated_since": "8.1.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with other databases, I think we should be extremely careful about deprecating commands. Some users are very careful to not use deprecated commands. I remember when some product team were working on upgrading to Valkey and reading all the release notes of the 7.x versions, they noticed that CLUSTER SLOTS is deprecated, it almost caused them a lot of extra work, until I explained to them that it is no longer deprecated.

We can document that it is an alias of COMMANDLOG, but we don't have to mark it as deprecated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need a "not-recommended" field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just describe it in the docs in text?

Just curious, why mark as not recommended? Is there a reason to make users move away from the slowlog commands?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the new command 'COMMANDLOG get count slow | heavytraffic-input | heavytraffic-output.' is encouraged in new version, thus the command 'Slowlog' is not suggested to be used anymore.

I think it makes no sense that 2 commands with the same weight implement the same function in a system, they are duplicated.

Copy link
Contributor

@zuiderkwast zuiderkwast Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so there are pros and cons:

  • Avoid duplicate/alias commands. Of course this is good to avoid.
  • Keep compatibility with other systems (forks and managed databases) and tools written for them. This is a good thing too.

You say keep compatibility with other forks and tools makes no sense? Or you say we keep SLOWLOG for backward compatibility but it's recommended to use COMMANDLOG instead?

This is fine. I'm just concerned that if we say "deprecated", users will think that we want to remove it and diverge from redis and other forks. This is a bad communication to the community IMO.

I have seen some comments that people are worried that Valkey will diverge from open source Redis versions. If Redis and Valkey diverges, I hope we can safely say it is Redis fault, not our fault. Did you see the recent discussion about the redis-rs Rust library?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so there are pros and cons:

  • Avoid duplicate/alias commands. Of course this is good to avoid.
  • Keep compatibility with other systems (forks and managed databases) and tools written for them. This is a good thing too.

You say keep compatibility with other forks and tools makes no sense? Or you say we keep SLOWLOG for backward compatibility but it's recommended to use COMMANDLOG instead?

First all, I do not express making no sense to compatible with other forks and tools. I think Valkey should try the best compatible with others. And I would suggest in the new version, user had better choose the COMMANDLOG command instead of SLOWLOG command.

This is fine. I'm just concerned that if we say "deprecated", users will think that we want to remove it and diverge from redis and other forks. This is a bad communication to the community IMO.

Of course, from my personal experience, if I see API noted deprecated, I would like to avoid to use them. Maybe we can find the other words to encourage to use COMMANDLOG command

I have seen some comments that people are worried that Valkey will diverge from open source Redis versions. If Redis and Valkey diverges, I hope we can safely say it is Redis fault, not our fault. Did you see the recent discussion about the redis-rs Rust library?

Yes, I just saw the lots of comments from redis-rs discussion yesterday. But for deprecating an existing command in Redis OSS, I think we could find a better solution to let our user know to deal with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First all, I do not express making no sense to compatible with other forks and tools.

Sorry! I know you don't.

I think we want the same thing. I just wanted to highlight that there are pros and cons. Maybe I'm overrreacting. Maybe deprecated is right here.

Whatever we call it, deprecated or not recommended, we can describe in the text that there is no plan to remove the slowlog command. It can make worried people less worried maybe.

Copy link
Member

@hwware hwware Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First all, I do not express making no sense to compatible with other forks and tools.

Sorry! I know you don't.

Do not worry, I know you ^_^

I think we want the same thing. I just wanted to highlight that there are pros and cons. Maybe I'm overrreacting. Maybe deprecated is right here.

Whatever we call it, deprecated or not recommended, we can describe in the text that there is no plan to remove the slowlog command. It can make worried people less worried maybe.

In fact, 'not recommended' is much properly than deprecated if we can add it in json file. @soloestoy can we? haha,
I never did it before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, "not-recommended" or "not-recommended-since" field is not bad, haha. We can keep replaced-by too.

It has no effect in any tools, but we can use it for the website. We just need to implement some code for it in website repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with "not-recommended", I think we can open another PR to implement and discuss which commands should be changed from "deprecated" to "no-recommended".

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants