Proxy release 2024-05-23 #7853

vipvap · 2024-05-23T06:01:59Z

Proxy release 2024-05-23

Please merge this Pull Request using 'Create a merge commit' button

Adds ordering asserts to the output of the delta key iterator `MergeDeltaKeys` that implements a k-merge. Part of #7296 : the asserts added by this PR get hit in the reproducers of #7296 as well, but they are earlier in the pipeline.

Improves the tiered compaction tests: * Adds a new test that is a simpler version of the ignored `test_many_updates_for_single_key` test. * Reduces the amount of data that `test_many_updates_for_single_key` processes to make it execute more quickly. * Adds logging support.

## Problem The main point of this PR is to get rid of `python-jose` and `ecdsa` packages as transitive dependencies through `moto`. They have a bunch of open vulnerabilities[1][2][3] (which don't affect us directly), but it's nice not to have them at all. - [1] GHSA-wj6h-64fc-37mp - [2] GHSA-6c5p-j8vq-pqhj - [3] GHSA-cjwg-qfpm-7377 ## Summary of changes - Update `moto` from 4.1.2 to 5.0.6 - Update code to accommodate breaking changes in `moto_server`

…ServerConf` (#7642) Before this PR, `neon_local` would store a copy of a subset of the initial `pageserver.toml` in its `.neon/config`, e.g, `listen_pg_addr`. That copy is represented as `struct PageServerConf`. This copy was used to inform e.g., `neon_local endpoint` and other commands that depend on Pageserver about which port to connect to. The problem with that scheme is that the duplicated information in `.neon/config` can get stale if `pageserver.toml` is changed. This PR fixes that by eliminating populating `struct PageServerConf` from the `pageserver.toml`s. The `[[pageservers]]` TOML table in the `.neon/config` is obsolete. As of this PR, `neon_local` will fail to start and print an error informing about this change. Code-level changes: - Remove the `--pg-version` flag, it was only used for some checks during `neon_local init` - Remove the warn-but-continue behavior for when auth key creation fails but auth keys are not required. It's just complexity that is unjustified for a tool like `neon_local`. - Introduce a type-system-level distinction between the runtime state and the two (!) toml formats that are almost the same but not quite. - runtime state: `struct PageServerConf`, now without `serde` derives - toml format 1: the state in `.neon/config` => `struct OnDiskState` - toml format 2: the `neon_local init --config TMPFILE` that, unlike `struct OnDiskState`, allows specifying `pageservers` - Remove `[[pageservers]]` from the `struct OnDiskState` and load the data from the individual `pageserver.toml`s instead.

Fixes flaky test `test_gc_of_remote_layers`, which was failing because of the `Nothing to GC` pageserver log. I looked into the fails, it seems that backround `gc_loop` sometimes started GC for initial tenant, which wasn't configured to disable GC. The fix is to not create initial tenant with enabled gc at all. Fixes #7538

## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics

The test utils should only be used during tests. Users should not be able to create this extension on their own. Signed-off-by: Alex Chi Z <chi@neon.tech>

- Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.

## Problem See #6714, #6967 ## Summary of changes Completely ignore page header when comparing VM pages. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

For aux file keys (v1 or v2) the vectored read path does not return an error when they're missing. Instead they are omitted from the resulting btree (this is a requirement, not a bug). Skip updating the metric in these cases to avoid infinite results.

…(updates AWS SDKs) (#7664) Before this PR, using the AWS SDK profile feature for running against minio didn't work because * our SDK versions were too old and didn't include awslabs/aws-sdk-rust#1060 and * we didn't massage the s3 client config builder correctly. This PR * udpates all the AWS SDKs we use to, respectively, the latest version I could find on crates.io (Is there a better process?) * changes the way remote_storage constructs the S3 client, and * documents how to run the test suite against real S3 & local minio. Regarding the changes to `remote_storage`: if one reads the SDK docs, it is clear that the recommended way is to use `aws_config::from_env`, then customize. What we were doing instead is to use the `aws_sdk_s3` builder directly. To get the `local-minio` in the added docs working, I needed to update both the SDKs and make the changes to the `remote_storage`. See the commit history in this PR for details. Refs: * byproduct: smithy-lang/smithy-rs#3633 * follow-up on deprecation: #7665 * follow-up for scrubber S3 setup: #7667

## Problem Various performance test cases were destabilized by the recent upgrade of `reqwest`, because it changes an error string. Examples: - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/3f984e471a9029a5/ - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/8bd0f095fe0402b7/ The performance tests suffer from this more than most tests, because they churn enough data that the pageserver is still trying to contact the storage controller while it is shut down at the end of tests. ## Summary of changes s/Connection refused/error sending request/

…cheduling optimization (#7673) ## Problem Storage controller was using a zero layer count in SecondaryProgress as a proxy for "not initialized". However, in tenants with zero timelines (a legitimate state), the layer count remains zero forever. This caused #7583 to destabilize the storage controller scale test, which creates lots of tenants, some of which don't get any timelines. ## Summary of changes - Use a None mtime instead of zero layer count to determine if a SecondaryProgress should be ignored. - Adjust the test to use a shorter heatmap upload period to let it proceed faster while waiting for scheduling optimizations to complete.

This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

…ts/csharp/npgsql (#7680)

## Problem #7637 breaks forward compat test. On commit ea531d4. https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988324349/index.html ``` test_create_snapshot 2024-05-07T16:03:11.331883Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:03:11.316131763 UTC build_tag: build_tag-env:5159 test_forward_compatibility 2024-05-07T16:07:02.310769Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:07:02.294676183 UTC build_tag: build_tag-env:5159 ``` The forward compatibility test is actually using the same tag as the current build. The commit before that, https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988126011/index.html ``` test_create_snapshot 2024-05-07T15:47:21.900796Z INFO version: git-env:2dbd1c1ed5cd0458933e8ffd40a9c0a5f4d610b8 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:47:21.882784185 UTC build_tag: build_tag-env:5158 test_forward_compatibility 2024-05-07T15:50:48.828733Z INFO version: git-env:c4d7d5982553d2cf66634d1fbf85d95ef44a6524 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:50:48.816635176 UTC build_tag: build_tag-env:release-5434 ``` This pull request patches the bin path so that the new neon_local will use the old binary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

## Problem This caused a variation of the stats bug fixed by #7662. That PR also fixed this case, but we still shouldn't make redundant get calls. ## Summary of changes - Only call get in the create image layers loop at the end of a range if some keys have been accumulated

## Problem Move from aws based arm64 runners to bare-metal based ## Summary of changes Changes in GitHub action workflows where `runs-on: arm64` used. More parallelism added, build time for `neon with extra platform builds` workflow reduced from 45m to 25m

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.

in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

@hlinnaka

The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994

In timeline detach ancestor tests there is no way to really be sure that there were no subtle off-by one bugs. One such bug is demoed and reverted. Add verifying fullbackup is equal before and after detaching ancestor. Fullbackup is expected to be equal apart from `zenith.signal`, which is known to be good because endpoint can be started without the detached branch receiving writes.

The test has been flaky since 2024-04-11 for unknown reason, and the logging was off. Fix the logging and raise the limit a bit. The problematic ratio reproduces with pg14 and added sleep (not included) but not on pg15. The new ratio abs diff limit works for all inspected examples. Cc: #7536

Per [evidence] the timeline ancestor detach tests can panic while shutting down on vectored get validation. Allow the error because tenant is restarted twice in the test. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7708/9058185709/index.html#suites/a1c2be32556270764423c495fad75d47/d444f7e5c0a18ce9

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211. This was done earlier already in commit 0115fe6, but reverted before it was released to production in commit bbe730d because of issue #7692. That issue was fixed in commit 22afaea, so we are ready to change the default again.

## Problem If an existing user already has some aux v1 files, we don't want to switch them to the global tenant-level config. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Despite making password hashing async, it can still take time away from the network code. ## Summary of changes Introduce a custom threadpool, inspired by rayon. Features: ### Fairness Each task is tagged with it's endpoint ID. The more times we have seen the endpoint, the more likely we are to skip the task if it comes up in the queue. This is using a min-count-sketch estimator for the number of times we have seen the endpoint, resetting it every 1000+ steps. Since tasks are immediately rescheduled if they do not complete, the worker could get stuck in a "always work available loop". To combat this, we check the global queue every 61 steps to ensure all tasks quickly get a worker assigned to them. ### Balanced Using crossbeam_deque, like rayon does, we have workstealing out of the box. I've tested it a fair amount and it seems to balance the workload accordingly

part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem This test relied on some sleeps, and was failing ~5% of the time. ## Summary of changes Use log-watching rather than straight waits, and make timeouts more generous for the CI environment.

## Problem Failures on some of our uglier shutdown log messages: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9192662995/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/51b365408678c66f/ ## Summary of changes - Allow-list these errors.

For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Signed-off-by: Alex Chi Z <chi@neon.tech>

One change: runner: allow coredump collection (#931)

github-actions · 2024-05-23T06:49:23Z

3108 tests run: 2981 passed, 0 failed, 127 skipped (full report)

Flaky tests (3)

Postgres 16

test_vm_bit_clear_on_heap_lock: debug

Postgres 15

test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Postgres 14

test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Code coverage* (full report)

functions: 31.4% (6449 of 20544 functions)
lines: 48.3% (49864 of 103254 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
03c2c56 at 2024-05-23T10:35:26.179Z :recycle:}

conradludgate · 2024-05-23T08:53:51Z

2 week release, so for ease of review I've collected all the proxy changes in this list

## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.

danieltprice · 2024-05-23T22:36:57Z

Reviewed for changelog.

arpad-m and others added 30 commits May 8, 2024 13:22

Expose LSN and replication delay as metrics (#7610)

21e1a49

## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics

chore(neon_test_utils): restrict installation to superuser (#7624)

1173ee6

The test utils should only be used during tests. Users should not be able to create this extension on their own. Signed-off-by: Alex Chi Z <chi@neon.tech>

Revert "chore(neon_test_utils): restrict installation to superuser" (#…

2682e02

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

build(deps): bump Npgsql from 8.0.2 to 8.0.3 in /test_runner/pg_clien…

5ea117c

…ts/csharp/npgsql (#7680)

Proxy added per ep rate limiter (#7636)

be1a88e

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

Fix permissions for safekeeper failpoints (#7669)

0b02043

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

build: run doctests (#7697)

6206f76

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

Fix checkpoint metric (#7701)

95098c3

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

hlinnaka and others added 9 commits May 22, 2024 18:24

feat(pageserver): auto-detect previous aux file policy (#7841)

64577cf

## Problem If an existing user already has some aux v1 files, we don't want to switch them to the global tenant-level config. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(pageserver): use kebab case for aux file flag (#7840)

ddd8ebd

part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

tests: refine test_secondary_background_downloads (#7829)

014f822

## Problem This test relied on some sleeps, and was failing ~5% of the time. ## Summary of changes Use log-watching rather than straight waits, and make timeouts more generous for the CI environment.

chore(pageserver): add force aux file policy switch handler (#7842)

4a278cc

For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(pageserver): use kebab case for compaction algorithms (#7845)

ff560a1

Signed-off-by: Alex Chi Z <chi@neon.tech>

Bump vm-builder v0.28.1 -> v0.29.3 (#7849)

eb0c026

One change: runner: allow coredump collection (#931)

vipvap requested review from a team as code owners May 23, 2024 06:02

vipvap requested review from hlinnaka, petuhovskiy, khanova and ololobus and removed request for a team May 23, 2024 06:02

[proxy] Do not fail after parquet upload error (#7858)

03c2c56

## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.

khanova enabled auto-merge May 23, 2024 09:55

conradludgate approved these changes May 23, 2024

View reviewed changes

khanova merged commit 7cf0f6b into release-proxy May 23, 2024
48 of 51 checks passed

khanova deleted the rc/proxy/2024-05-23 branch May 23, 2024 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-05-23 #7853

Proxy release 2024-05-23 #7853

vipvap commented May 23, 2024

github-actions bot commented May 23, 2024 •

edited

Loading

Postgres 16

Postgres 15

Postgres 14

conradludgate commented May 23, 2024 •

edited by khanova

Loading

danieltprice commented May 23, 2024

Proxy release 2024-05-23 #7853

Proxy release 2024-05-23 #7853

Conversation

vipvap commented May 23, 2024